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What Is Bioinformatics? 


Biological data is proliferating rapidly. Public databases such as GenBank and the Protein 
Data Bank have been growing exponentially for some time now. With the advent of the 
World Wide Web and fast Internet connections, the data contained in these databases and 
a great many special-purpose programs can be accessed quickly, easily, and cheaply from 
any location in the world. As a consequence, computer-based tools now play an 
increasingly critical role in the advancement of biological research. 


Bioinformatics, a rapidly evolving discipline, is the application of computational tools 
and techniques to the management and analysis of biological data. The term 
bioinformatics is relatively new, and as defined here, it encroaches on such terms as 
"computational biology" and others. The use of computers in biology research predates 
the term bioinformatics by many years. For example, the determination of 3D protein 
structure from X-ray crystallographic data has long relied on computer analysis. In this 
book I refer to the use of computers in biological research as bioinformatics. It's 
important to be aware, however, that others may make different distinctions between the 
terms. In particular, bioinformatics is often the term used when referring to the data and 
the techniques used in large-scale sequencing and analysis of entire genomes, such as C. 
elegans, Arabidopsis, and Homo sapiens. 


What Bioinformatics Can Do 
Here's a short example of bioinformatics in action. Let's say you have discovered a very 


interesting segment of mouse DNA and you suspect it may hold a clue to the 
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development of fatal brain tumors in humans. After sequencing the DNA, you perform a 
search of Genbank and other data sources using web-based sequence alignment tools 
such as BLAST. Although you find a few related sequences, you don't get a direct match 
or any information that indicates a link to the brain tumors you suspect exist. You know 
that the public genetic databases are growing daily and rapidly. You would like to 
perform your searches every day, comparing the results to the previous searches, to see if 
anything new appears in the databases. But this could take an hour or two each day! 
Luckily, you know Perl. With a day's work, you write a program (using the Bioperl 
module among other things) that automatically conducts a daily BLAST search of 
Genbank for your DNA sequence, compares the results with the previous day's results, 
and sends you email if there has been any change. This program is so useful that you start 
running it for other sequences as well, and your colleagues also start using it. Within a 
few months, your day's worth of work has saved many weeks of work for your 
community. This example is taken from real life. There are now existing programs you 
can use for this purpose, even web sites where you can submit your DNA sequence and 
your email address, and they'll do all the work for you! 


This is only a small example of what happens when you apply the power of computation 
to a biological problem. This is bioinformatics. 


About This Book 


This book is a tutorial for biologists on how to program, and is designed for beginning 
programmers. The examples and exercises with only a few exceptions use biological data. 
The book's goal is twofold: it teaches programming skills and applies them to interesting 
biological areas. 


I want to get you up and programming as quickly and painlessly as possible. I aim for 
simplicity of explanation, not completeness of coverage. I don't always strictly define the 
programming concepts, because formal definitions can be distracting. 


The Perl language makes it possible to start writing real programs quickly. As you 
continue reading this book and the online Perl documentation, you'll fill in the details, 
learn better ways of doing things, and improve your understanding of programming 
concepts. 


Depending on your style of learning, you can approach this material in different ways. 
One way, as the King gravely said to Alice, is to "Begin at the beginning and go on till 
you come to the end: then stop." (This line from Alice in Wonderland is often used as a 
whimsical definition of an algorithm.) The material is organized to be read in this fashion, 
as a narrative. 


Another approach is to get the programs into your computer, run them, see what they do, 
and perhaps try to alter this or that in the program to see what effect your changes have. 
This may be combined with a quick skim of the text of the chapter. This is a common 
approach used by programmers when learning a new language. Basically, you learn by 
imitation, looking at actual programs. 
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Anyone wishing to learn Perl programming for bioinformatics should try the exercises 
found at the end of most chapters. They are given in approximate order of difficulty, and 
some of the higher-numbered exercises are fairly challenging and may be appropriate for 
classroom projects. Because there's more than one way to do things in Perl, there is no 
one correct answer to an exercise. If you're a beginning programmer, and you manage to 
solve an exercise in any way whatsoever, you've succeeded at that exercise. My 
suggested solutions to the exercises may be found at 


http://www.oreilly.com/catalog/begperlbio. 


I hope that the material in this book will serve not only as a practical tutorial, but also as a 
first step to a research program if you decide that bioinformatics is a promising research 
direction in itself or an adjunct to ongoing investigations. 


Who This Book Is For 


This books is a practical introduction to programming for biologists. 


Programming skills are now in strong demand in biology research and development. 
Historically, programming has not often been viewed as a critical skill for biologists at 
the bench. However, recent trends in biology have made computer analysis of large 
amounts of data central to many research programs. This book is intended as a hands-on, 
one-volume course for the busy biologist to acquire practical bioinformatics 
programming abilities. So, if you are a biologist who needs to learn programming, this 
book is for you. Its goal is to teach you how to write useful and practical bioinformatics 
programs as quickly and as painlessly as possible. 


This book introduces programming as an important new laboratory skill; it presents a 
programming tutorial that includes a collection of "protocols," or programming 
techniques, that can be immediately useful in the lab. But its primary purpose is to teach 
programming, not to build a comprehensive toolkit. 


There is a real blending of skills and approaches between the laboratory bench and the 
computer program. Many people do indeed find themselves shifting from running gels to 
writing Perl in the course of a day—or a career—in biology research. Of course, 
programming is its own discipline with its own methods and terminology, and so must be 
approached on its own terms. But there is cross-fertilization going on (if you'll pardon the 
metaphor between the two disciplines). 


This book's exercises are of varying difficulty for those using it as a class textbook or for 
self study. (Almost) all examples and exercises are based on real biological problems, 
and this book will give you a good introduction to the most common bioinformatics 
programming problems and the most common computer-based biological data. 


This book's web site, http://www.oreilly.com/catalog/begperlbio, includes all the 
program code in the book for convenient download, including the exercises and solutions, 
plus errata and other information.“ 
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"1 program code, or simply code, means a computer program—the actual Perl language 
commands a programmer writes in a file. 


Why Should I Learn to Program? 


Since many researchers who describe their work as "bioinformatics" don't program at all, 
but rather, use programs written by others, it's tempting to ask, "Do I really need to learn 
programming to do bioinformatics?" At one level, the answer is no, you don't. You can 
accomplish quite a bit using existing tools, and there are books and documentation 
available to help you learn those tools. But at another, higher level, the answer to the 
question changes. What happens when you want to do something a preexisting tool 
doesn't do? What happens when you can't find a tool to accomplish a particular task, and 
you can't find someone to write it for you? 


At that point, you need to learn to program. And even if you still rely mainly on existing 
programs and tools, it can be worthwhile to learn enough to write small programs. Small 
programs can be incredibly useful. For example, with a bit of practice, you can learn to 
write programs that run other programs and spare yourself hours sitting in front of the 
computer doing things by hand. 


Many scientists start out writing small programs and find that they really like 
programming. As a programmer, you never need to worry about finding the right tools 
for your needs; you can write them yourself. This book will get you started. 


Structure of This Book 


There are thirteen chapters and two appendixes in this book. The following provides a 
brief introduction: 


Chapter 1 
This chapter covers some key concepts in molecular biology, as well as how 
biology and computer science fit together. 

Chapter 2 
This chapter shows you how to get Perl up and running on your computer. 


Chapter 3 


Chapter 3 provides an overview as to how programmers accomplish their jobs. 
Some of the most important practical strategies good programmers use are 
explained, and where to find answers to questions that arise while you are 
programming is carefully laid out. These ideas are made concrete by brief 
narrative case studies that show how programmers, given a problem, find its 
solution. 


Chapter 4 


In Chapter 4 you start writing Perl programs with DNA and proteins. The 
programs transcribe DNA to RNA, concatenate sequences, make the reverse 
complement of DNA, read sequences data from files, and more. 
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Chapter 5 


This chapter continues demonstrating the basics of the Perl language with 
programs that search for motifs in DNA or protein, interact with users at the 
keyboard, write data to files, use loops and conditional tests, use regular 
expressions, and operate on strings and arrays. 


Chapter 6 


This chapter extends the basic knowledge of Perl in two main directions: 
subroutines, which are an important way to structure programs, and the use 
of the Perl debugger, which can examine in detail a running Perl program. 


Chapter 7 


Genetic mutations, fundamental to biology, are modelled as random events 
using the random number generator in Perl. This chapter uses random 
numbers to generate DNA sequence data sets, and to repeatedly mutate DNA 
sequence. Loops, subroutines, and lexical scoping are also discussed. 


Chapter 8 


This chapter shows how to translate DNA to proteins, using the genetic code. 

It also covers a good bit more of the Perl programming language, such as the 

hash data type, sorted and unsorted arrays, binary search, relational 

databases, and DBM, and how to handle FASTA formatted sequence data. 
Chapter 9 


This chapter contains an introduction to Perl regular expressions. The main 
focus of the chapter is the development of a program to calculate a restriction 
map for a DNA sequence. 


Chapter 10 


The Genetic Sequence Data Bank (GenBank) is central to modern biology and 
bioinformatics. In this chapter, you learn how to write programs to extract 
information from GenBank files and libraries. You will also make a database to 
create your own rapid access lookups on a GenBank library. 


Chapter 11 


This chapter develops a program that can parse Protein Data Bank (PDB) files. 
Some interesting Perl techniques are encountered while doing so, such as 
finding and iterating over lots of files and controlling other bioinformatics 
programs from a Perl program. 


Chapter 12 


Chapter 12 develops some code to parse a BLAST output file. Also mentioned 
are the Bioperl project and its BLAST parser, and some additional ways to format 
output in Perl. 


Chapter 13 
Chapter 13 looks ahead to topics beyond the scope of this book. 


Appendix A 


Collected here are resources for Perl and for bioinformatics programming, 
such as books and Internet sites. 
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Appendix B 
This is a Summary of the parts of Perl covered in this book, plus a little more. 


Conventions Used in This Book 


The following conventions are used in this book: 


Italic 


Used for commands, filenames, directory names, variables, modules, URLs, 
and for the first use of a term 


Constant width 


Used in code examples and to show the output of commands 


pa This icon designates a note, which is an important aside to the 
4», nearby text. 


+e This icon designates a warning relating to the nearby text. 





Comments and Questions 


Please address comments and questions concerning this book to the publisher: 


O'Reilly & Associates, Inc. 

1005 Gravenstein Highway North 

Sebastopol, CA 95472 

(800) 998-9938 (in the United States or Canada) 
(707) 829-0515 (international/local) 

(707) 829-0104 (fax) 


There is a web page for this book, which lists errata, examples, or any additional 
information. You can access this page at: 


http://www.oreilly.com/catalog/begperlbio 





To comment or ask technical questions about this book, send email to: 


bookquestions@oreilly.com 





For more information about books, conferences, Resource Centers, and the O'Reilly 
Network, see the O'Reilly web site at: 
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http://www.oreilly.com 
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Chapter 1. Biology and Computer Science 


One of the most exciting things about being involved in computer programming and 
biology is that both fields are rich in new techniques and results. 


Of course, biology is an old science, but many of the most interesting directions in 
biological research are based on recent techniques and ideas. The modern science of 
genetics, which has earned a prominent place in modern biology, is just about 100 years 
old, dating from the widespread acknowledgement of Mendel's work. The elucidation of 
the structure of deoxyribonucleic acid (DNA) and the first protein structure are about 50 
years old, and the polymerase chain reaction (PCR) technique of cloning DNA is almost 
20 years old. The last decade saw the launching and completion of the Human Genome 
Project that revealed the totality of human genes and much more. Today, we're in a 
golden age of biological research—a point in human history of great medical, scientific, 
and philosophical importance. 


Computer science is relatively new. Algorithms have been around since ancient times 
(Euclid), and the interest in computing machinery is also antique (Pascal's mechanical 
calculator, for instance, or Babbage's steam-driven inventions of the 19th century). But 
programming was really born about 50 years ago, at the same time as construction of the 
first large, programmable, digital/electronic (the ENIAC ) computers. Programming has 
grown very rapidly to the present day. The Internet is about 20 years old, as are personal 
computers; the Web is about 10 years old. Today, our communications, transportation, 
agricultural, financial, government, business, artistic, and of course, scientific endeavors 
are closely tied to computers and their programming. 


This rapid and recent growth gives the field of computer programming a certain 
excitement and requires that its professional practitioners keep on their toes. In a way, 
programming represents procedural knowledge—the knowledge of how to do things— 
and one way to look at the importance of computers in our society and our history is to 
see the enormous growth in procedural knowledge that the use of computers has 
occasioned. We're also seeing the concepts of computation and algorithm being adopted 
widely, for instance, in the arts and in the law, and of course in the sciences. The 
computer has become the ruling metaphor for explaining things in general. Certainly, it's 
tempting to think of a cell's molecular biology in terms of a special kind of computing 
machinery. 


Similarly, the remarkable discoveries in biology have found an echo in computer science. 
There are evolutionary programs, neural networks, simulated annealing, and more. The 
exchange of ideas and metaphors between the fields of biology and computer science is, 
in itself, a spur to discovery (although the dangers of using an improper metaphor are also 
real). 


1.1 The Organization of DNA 


It's necessary to review some of the very basic concepts and terminology of DNA and 


IT-SC 12 


positions at this point. This review is for the benefit of the nonbiologist; if you're a 
biologist you can skip the next two sections. 


DNA is a polymer composed of four molecules, usually called bases or nucleotides. Their 
names and one-letter abbreviations are adenine (A), cytosine (C), guanine (G), and 
thymine (T).“! (See Chapter 4 for more about how DNA is represented as computer data.) 
The bases joined end to end to form a single strand of DNA. 


"1 These names come from where they were originally found: the glands, the cell, guano, and 
the thymus. 


In the cell, DNA usually appears in a double-stranded form, with two strands wrapped 
around each other in the famous double helix shape. The two strands of the double helix 
have matching bases, known as the base pairs. An A on one strand is always opposite a T 
on the other strand, and a G is always paired with a C. 


There is also an orientation to the strands. One end of a nucleotide is called the 5' (five 
prime) end, and the other is called the 3' (three prime) end. When nucleotides join to 
make a single strand of DNA, they always connect the 5' end of one to the 3' end of the 
other. Furthermore, when the cell uses the DNA, as in translating it to RNA, it does so 
base by base from the 5' to the 3' direction. So, when DNA is written, it's done so left to 
right on the page, corresponding to the 5' to 3' orientation of the bases. An encoded gene 
can appear on either strand, so it's important to look at both strands when searching or 
analyzing DNA. 


When two strands are joined in a double helix (as in Figure 1-1), the two strands have 
opposite orientations. That is, the 5' to 3' orientation of one strand runs in an opposite 
direction as the 5' to 3' orientation of the other strand. So at each end of the double helix, 
one strand has a 3' end; the other has a 5' end. 


Figure 1-1. Two strands of DNA 


__ Orientation: read left to right 
CG -T FT 
I | 1 | 
GCAA 


Orienta 


Q—O 


A 
| 
T 


Q—A 


= 
| 
A 


read right to left 


yH— > 


G 

Cc 
3 . 

mm 
Because the base pairs are always matched A-T and C-G and the orientation of the 
strands are the reverse of each other, the term reverse complement describes the 
relationship of the bases of the two strands. It's "reverse" because the orientations are 


reversed, and "complement" because the bases always pair to their complementary bases, 
A to T and C to G. 


Given these facts and a single strand of DNA, it's easy to figure what the matching strand 
would be in the double helix. Simply change all bases to their complements: A to T, T to 
A, C to G, and G to C. Then, since DNA is written in the 5' to 3' direction, after 
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complementing the DNA, write it in reverse. 


Genbank, the Genetic Sequence Data Bank (http://www.ncbi.nlm.nih.gov), contains 
most known sequence data. We'll take a closer look at GenBank in Chapter 10. 


1.2 The Organization of Proteins 


Proteins are somewhat similar to DNA. They are also polymers, long strings made up of 
a small number of simple molecules. As DNA is composed of four nucleotides, so 
proteins are composed of 20 amino acids. These amino acids may occur in any order. See 
Table 4-2 for the names and one- and three-letter abbreviations for the amino acids. 


Amino acids are composed of an amino group and a carboxyl group. They form a 
chemical bond, called a peptide bond, between the amino group and the carboxyl group 
of adjacent amino acids. Each of the 20 amino acids has a different sidechain, which 
protrudes from the backbone. The chemical properties of the sidechains are important in 
determining the properties of the protein. 


Proteins usually have a more complex 3D structure than DNA. The peptide bonds have a 
great deal of rotational freedom, which allows proteins to form many 3D structures. 
Instead of DNA's double helix, proteins tend to fold up in a variety of different shapes 
and are composed of one or more strands of amino acids assembled together.” The 
sequence of amino acids along the strand is called the primary structure. The coiling in on 
itself into local structures such as helices, beta-strands, and turns, is called the secondary 
structure. The final foldings and assemblies are called the tertiary and quaternary 
structure of proteins (see Chapter 11). 


21] try to avoid most of the potentially confusing biology in this text in order to concentrate 
on learning Perl, but I can't help mentioning at this point that DNA also has a more complex 
3D structure. It can appear as one-stranded, two-stranded, and three-stranded forms, and it 
is also coiled and recoiled into a small space during most of the life of the cell. 


There is more primary sequence data available than secondary or higher structural data. 
In fact, a great deal of primary protein sequence data is available (since it is relatively 
easy to identify primary protein sequence from DNA, of which a great deal has been 
sequenced). 


The Protein Data Bank (PDB) contains structural information about thousands of proteins, 
the accumulated knowledge of decades of work. We'll look at the PDB in Chapter 10, 
but you may want to get a headstart by visiting the PDB _ web site 
(http://www.rcsb.org/pdb/) to become familiar with this essential bioinformatics 
resource. 





1.3 In Silico 


Recently, the new term in silico has become a common reference to biological studies 
carried out in the computer, joining the traditional terms in vivo and in vitro to describe 
the location of experimental studies. 


For nonbiologists, in vitro means "in glass," that is, in the test tube; in vivo means "in 
life," that is, in a living organism. The term in silico stems from the fact that most 
computer chips are made primarily of silicon. Personally, I prefer a term such as in 
algorithmo, since there are plenty of ways to compute that don't involve silicon, such as 
the intriguing processes of DNA computing, quantum computing, optical computing, and 
more. 


The large amount of biological data available online has brought biological research to a 
situation somewhat similar to physics and astronomy. Those sciences have found that 
experiments in modern equipment produce huge amounts of data, and the computer isn't 
only invaluable but necessary for exploring the data. Indeed, it's become possible to 
simulate experiments entirely in the computer. For instance, an early use of computer 
simulation in physics was in modeling the acoustics of a concert hall and then 
experimenting with the results by changing the design of the hall—clearly a much 
cheaper way to experiment than by building dozens of concert halls! 


A similar trend has been occurring in biology since computers were first invented, but 
this trend has sharply accelerated in recent years with the Human Genome Project and the 
sequencing of the DNA of many organisms. The experimental data that has to be 
collected, searched, and analyzed is often far too large for the unaided biologist, who is 
now forced to rely on computers to manage the information. 


Beyond the storage and retrieval of biological data, it's now possible to study living 
systems through computer simulation. There are standard and accepted studies done 
routinely on computers that access the genes of humans and of several other organisms. 
When the sequence of some DNA is determined, it can be stored in the computer, and 
programs can be written to identify restriction sites, perform restriction digests and create 
restriction maps (see Chapter 9). Similarly, gene-finding programs can take sequenced 
DNA and identify putative exons and introns. (Not perfectly, as of this writing, and 
results differ for different organisms.) Models of cellular processes exist in which it is 
possible to study for example, the effect of a change in the regulation of a gene. 


Today, microarray technology (incorporating glass slides spotted with thousands of 
samples that can be probed, usually with the aid of robotics) can assess the levels of 
expression of thousands of genes with one laboratory run. Computers are helping to 
unravel the complex interactions between genes. We hope to find, for example, all sets of 
genes related by virtue of their protein products as part of a biochemical pathway in the 
cell. Microarrays generate a large volume of data. This data needs to be stored, compared 
with other experimental data, and analyzed on the computer. 


On my first day as a programmer at Bell Labs Research, my boss told me that his 
simulations could now be computed so fast—overnight—that it was creating a problem 
for him. There wasn't enough time to think about the last stimulation! Nevertheless, and 
despite all the attendant headaches and pitfalls of computers, their use to simulate 
experiments is proving to be beneficial in biology. 


1.4 Limits to Computation 
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Some of the most interesting results of computer science demonstrate certain limits to 
human knowledge. There are many open problems in biology, and one hopes that 
applying more computer power to them may help solve them. But this isn't always 
possible, because some problems can be shown to be unsolvable; that is, they can't be 
solved by any program. Furthermore, some problems may be solvable, but as the size of 
the problem grows, they get practically impossible to solve. These problems are called 
intractable , or NP-complete. Even a million computers, each a million times more 
powerful than the most powerful computer existing today, could take perhaps a billion 
years to compute the answer to such an intractable problem. 


Now the chances are that you're not going to get stung by an unsolvable or intractable 
problem. It can happen, but it's relatively rare. I mention them more as a point of interest 
than as a practical concern to the beginning programmer. But as you attempt more 
complex programs down the road, these limitations, and especially the intractable nature 
of several biological problems, can have a practical impact on your programming efforts. 
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Chapter 2. Getting Started with Perl 


Perl is a popular programming language that's extensively used in areas such as 
bioinformatics and web programming. Perl has become popular with biologists because 
it's so well-suited to several bioinformatics tasks. 


Perl is also an application, just like any other application you might install on your 
computer. It is available (at no cost) and runs on all the operating systems found in the 
average biology lab (Unix and Linux, Macintosh, Windows, VMS, and more)."! The Perl 
application on your computer takes a Perl language program (such as one of the programs 
you will write in this book), translates it into instructions the computer can understand, 
and runs (or "executes") it. 


1] An operating system manages the running of programs and other basic services that a 
computer provides, such as how files are stored. 


So, the word Per! refers both to the language in which you will write programs and to the 
application on your computer that runs those programs. You can always tell from context 
which meaning is being used. 


Every computer language such as Perl needs to have a translator application (called an 
interpreter or compiler) that can turn programs into instructions the computer can actually 
run. So the Perl application is often referred to as the Perl interpreter, and it includes a 
Perl compiler as well. You will often see Perl programs referred to as Perl scripts or Perl 
code. The terms program, application, script, and executable are somewhat 
interchangeable. I refer to them as "programs" in this book. 


2.1 A Low and Long Learning Curve 


A nice thing about Perl is that you can learn to write programs fairly quickly; in essence, 
Perl has a low learning curve. This means you can get started easily, without having 
to master a large body of information before writing useful programs. 


Perl provides different styles of writing programs. Since these are beyond the scope of 
this book, I won't go into details, except to mention the popular style called imperative 
programming that you'll learn in this book. The equally popular style called object- 
oriented programming is also well-supported in Perl. Other styles of programming 
include functional programming and logic programming. 


Although you can get started quickly, learning all of Perl will certainly take awhile, if 
that's your goal. Most people learn the basics, as presented in this book, and then learn 
additional topics as needed. 

Let's get a few elementary definitions out of the way: 

What is a computer program? 
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It's a set of instructions written in a particular programming language that can be 
read by the computer. A program can be as simple as the following Perl language 
program to print some DNA sequence data onto the computer screen: 

print 'ACCTGGTAACCCGGAGATTCCAGCT'; 

The Perl language programs are written and saved in files, which are ways of 
saving any kind of data (not only programs) on a computer. Files are organized 
hierarchically in groups called folders on Macintosh or Windows systems or 
directories in Unix or Linux systems. The terms folder and directory will be used 
interchangeably. 


What is a programming language? 


It's a carefully defined set of rules for how to write computer programs. By 
learning the rules of the language, you can write programs that will run on your 
computer. Programming languages are similar to our own natural, or spoken 
languages, such as English, but are more strictly defined and specific to certain 
computer systems. With a little bit of training, it's not difficult to read or write 
computer programs. In this book you'll write in Perl; there are many other 
programming languages. 

A program that a programmer writes is also called source code, or just source or 
code. The source code has to be turned into machine language, a special language 
the computer can run. It's hard to write or read a machine language program 
because it's all binary numbers; it's often called a binary executable. You use the 
Perl interpreter (or compiler) to turn a Perl program into a running program, as 
you'll see later in this chapter. 


What is a computer? 


Well, ... 
Okay, silly question. It's that machine you buy in computer stores. But actually, it's 
important to have a clear idea of what kind of machine a computer is. Essentially, a 
computer is a machine that can run many different programs. This is the fundamental 
flexibility and adaptability that makes the computer such a useful and general-purpose 
tool. It's programmable; you will learn how to program it using the Perl programming 
language. 


2.2 Perl's Benefits 


The following sections illustrate some of Perl's strong points. 
2.2.1 Ease of Programming 


Computer languages differ in which things they make easy. By "easy" I mean easy for a 
programmer to program. Perl has certain features that simplifies several common 
bioinformatics tasks. It can deal with information in ASCII text files or flat files, which 
are exactly the kinds of files in which much important biological data appears, in the 
GenBank and PDB databases, among others. (See the discussion of ASCII in Chapter 
4; Genbank and PDB are the subjects in Chapter 10 and Chapter 11.) Perl makes it 
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easy to process and manipulate long sequences such as DNA and proteins. Perl makes it 
convenient to write a program that controls one or more other programs. As a final 
example, Perl is used to put biology research labs, and their results, on their own dynamic 
web sites. Perl does all this and more. 


Although Perl is a language that's remarkably suited to bioinformatics, it isn't the only 
choice nor is it always the best choice. Other programming languages such as C and Java 
are also used in bioinformatics. The choice of language depends on the problem to be 
programmed, the skills of the programmers, and the available system. 


2.2.2 Rapid Prototyping 


Another important benefit of using Perl for biological research is the speed with which a 
programmer can write a typical Perl program (referred to as rapid prototyping). Many 
problems can be solved in far fewer lines of Perl code than in C or Java. This has been 
important to its success in research. In a research environment there are frequent needs 
for programs that do something new, that are needed only once or occasionally, or that 
need to be frequently modified. In Perl, you can often toss such a program off in a few 
minutes or a few hours work, and the research can proceed. This rapid prototyping ability 
is often a key consideration when choosing Perl for a job. It is common to find 
programmers familiar with both Perl and C who claim that Perl is five to ten times faster 
to program in than C. The difference can be critical in the typical understaffed research 
lab. 


2.2.3 Portability, Speed, and Program Maintenance 


Portability means how many types of computer systems the language can run on. Perl 
has no problems there, as it's available for virtually all modern computers found in 
biology labs. If you write a DNA analyzer in Perl on your Mac, then move it to a 
Windows computer, you'll find it usually runs as is or with only minor retrofitting. 
Speed means the speed with which the program runs. Here Perl is pretty good but not 
the best. For speed of execution, the usual language of choice is C. A program written in 
C typically runs two or more times faster than the comparable Perl program. (There are 
ways of speeding up Perl with compilers and such, but still... .) 


In many organizations, programs are first written in Perl, and then only the programs that 
absolutely need to have maximum speed are rewritten in C. The fact is, maximum speed 
is only occasionally an important consideration. 


Programming is relatively expensive to do: it takes time, and skilled personnel. It's labor- 
intensive. On the other hand, computers and computer time (often called CPU time after 
the central processing unit) are relatively inexpensive. Most desktop computers sit idle 
for a large part of the day, anyway. So it's usually best to let the computer do the work, 
and save the programmer's time. Unless your program absolutely must run in say, four 
seconds instead of ten seconds, you're okay with Perl. 


Program maintenance is the general activity of keeping everything working: such 
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activities as adding features to a program, extending it to handle more types of input, 
porting it to run on other computer systems, fixing bugs, and so forth. Programs take a 
certain amount of time, effort and cost to write, but successful programs end up costing 
more to maintain than they did to write in the first place. It's important to write in a 
language, and in a style, that makes maintenance relatively easy, and Perl allows you to 
do so. (You can write obscure, hard-to-maintain code in Perl, as in other languages, but 
I'll give you pointers on how to make your code easy for other programmers to read.) 


2.2.4 Versions of Perl 


Perl, like almost all popular software, has gone through much growth and change over the 
course of its nearly 15-year life. The authors—Larry Wall and a large group of cohorts— 
publish new versions periodically. These new versions have been carefully designed to 
support most programs written under old versions, but occasionally some major new 
features are added that don't work with older versions of Perl. 


This book assumes you have Perl Version 5 or higher installed. If you have Perl installed 
on your computer, it's likely Perl 5, but it's best to check. On a Unix or Linux system, or 
from an MS-DOS or MacOS X command window, the perl -v command displays the 
version number, in my case, Version 5.6.1. The number 5.6.1 is "bigger" than 5; that 
means it's okay. If you get a smaller number (very likely 4.036), you have to install a 
recent version of Perl to enable the majority of programs in this book to run as shown. 


What about future versions? Perl is always evolving, and Perl Version 6 is on the horizon. 
Will the code in this book still work in Perl 6? The answer is yes. Although Perl 6 is 
going to add some new things to the language, it should have no trouble with the Perl 5 
code in this book. 


2.3 Installing Perl on Your Computer 


The following sections provide pointers for installing Perl on the most common types of 
computer systems. 


2.3.1 Perl May Already Be Installed! 


Many computers—especially Unix and Linux computers—come with Perl already 
installed. (Note that Unix and Linux are essentially the same kind of operating system; 
Linux is a clone, or functional copy, of a Unix system.) So first check to see if Perl is 
already there. On Unix and Linux, type the following at a command prompt: 


S$ perl -v 
If Perl is already installed, you'll see a message like the one I get on my Linux machine: 
This is perl, v5.6.1 built for 1686-linux 


Copyright 1987-2001, Larry Wall 
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Perl may be copied only under the terms of either the 
Artistic License or the 
GNU General Public License, which may be found in the Perl 
5 source kit. 








Complete documentation for Perl, including FAQ lists, 
should be found on 
this system using 'man perl' or 'perldoc perl'. If you 
have access to the 

Internet, point your browser at http://www.perl.com/, the 
Perl Home Page. 








If Perl isn't installed, you'll get a message like this: 
perl: command not found 


If you get this message, and you're on a shared Unix system at a university or business, 
be sure to check with the system administrator, because Perl may indeed be installed, but 
your environment may not be set to find it. (Or, the system administrator may say, "You 
need Perl? Okay, I'll install it for you.") 


On Windows or Macintosh, look at the program menus, or use the find program to 
search for perl. You can also try typing perl -v, at an MS-DOS command window or 
at a shell window on the MacOS X. (Note that the MacOS X is a Unix system!) 


2.3.2 No Internet Access? 


If you don't have Internet access, you can take your computer to a friend who has access 
and connect long enough to install Perl. You can also use a Zip drive or burn a CD from a 
friend's computer to bring the Perl software to your computer. There are commercial 
shrink-wrapped CDs of Perl available from several sources (ask at your local software 
store) and several books such as O'Reilly's Per! Resource Kits, include CDs with Perl. 


Apart from installing Perl, you don't need Internet access for everything in this book. If 
you want to do the exercises while commuting on the train, or whatever, it can certainly 
be done. Apart from installing Perl, the main use of the Internet for this book is to 
download its examples from the book's web site without having to type them; to 
download and try the exercises; to explore biological data from various biological 
databases; and to access Perl documentation, if it's not installed on your machine. 


Know that if you want to do bioinformatics, the Internet is a practical necessity. You can 


learn programming fundamentals from this book without an Internet connection, but you 
will need Internet access to download bioinformatics software and data. 


2.3.3 Downloading 
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Perl is an application, so downloading and installing it on your computer is pretty much 
the same as installing any other application. 


The web site that serves as a central jumping off point for all things Perl is 
http ://www.perl.com/. The main page has a Downloads clickable button that guides 
you to everything you need to install Perl on your computer. At the Downloads page, 
there's a Getting Help link and other links. So even if the information in this book 
becomes outdated, you can visit the Perl site and find all you need to install Perl. 


Downloading and installing Perl is usually quite easy, in fact, the majority of the time it's 
perfectly painless. However, sometimes you may have to put some effort into getting it to 
work. If you're new at programming, and you run into difficulties, you should ask for 
help from a professional computer programmer, administrator, teacher, or someone in 
your lab who already programs in Perl. 


So, in a nutshell, here are the basic steps for installing Perl on your computer: 

Check to see if Perl is already installed; if so, check the that version is at least Perl 5. 
Get Internet access and go to the Perl home page at http://www.perl.com/. 

Go to the Downloads page and determine which distribution of Perl to download. 
Download the correct Perl distribution. 


Install the distribution on your computer. 
2.3.4 Binary Versus Source Code 


When downloading from the http://www.perl.com site, you need to choose between 
binary or source-code distributions of Perl. The best choice for installing Perl on your 
computer is to get an already made binary version of the program, because it's the easiest 
to install. However, if no binary is available, or if you want to control the various options 
of your Perl installation, you can get the source code for Perl, which is itself written in 
the C programming language. You then compile it using a C compiler. But try to find a 
binary for your particular computer's operating system; compiling from source code can 
be complicated for beginners. 


2.3.5 Installation 

The next sections provide specific installation instructions for specific platforms. 

2.3.5.1 Unix and Linux 

If Perl isn't installed on your Unix or Linux machine, first try to find a binary to install. 


At the Downloads page of http://www.perl.com, you'll see the subheading Binary 
Distributions. Select Unix or Linux, and then see if your particular flavor of operating 
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system has a binary available. Several versions are available, and the web-site 
instructions should be enough to get Perl installed once you've downloaded the binary. 
Most versions of Linux maintain up-to-date Perl binaries on their web sites. For instance, 
if you have a Red Hat Linux system, you need to identify which version of the system 
you have (by typing uname ~a) and then get the appropriate rpm file to download and 
install. Red Hat has an rpm for Perl that Red Hat Linux users can install by typing: 

rpm -Uvh perl.rpm 


(the actual name of the per/.rpm file varies). 


If no binary version of Perl is available for your flavor of Unix or Linux, you must 
compile Perl from its source code. In this case, starting from the Perl web page, click on 
the Downloads button and then select Source Code Distribution. The source code has an 
INSTALL file with instructions that guide you through the process of downloading the 
source code, installing it on your system, compiling the source code into a binary, and 
finally installing the binary. 


As mentioned previously, compiling from source code is a considerably longer process 
than installing an already made binary, and requires a bit more reading of instructions, 
but it usually works quite well. You will need a C compiler on your computer to install 
from source code. Nowadays, some Unix systems ship without a complete C compiler. 
Linux will always have the free C compiler called gcc installed, and you can also install 
QCC on any Unix (or Windows, or Mac) system that lacks a C compiler. 


2.3.5.2 Macintosh 


The MacPerl installation steps are clearly explained on the MacPerl web page, 


http://www.macperl.com/ (which you can also get to from the Perl web page and 
its Downloads button). Here's a very brief overview. 


From the MacPerl page, click on Get MacPerl, and follow the directions to download the 
application. It will appear on your desktop. Double-click it to unstuff it. If you don't have 
Aladdin Stuffit Expander (most Macs already do), this won't work, and you'll have to go 


to http://www.aladdinsys.com to download and install Stuffit. 


MacPerl can be installed as a standalone application under the MacOS Finder or as a tool 
under the Macintosh Programmer's Workbench; you will probably want the standalone 
application. Perl Version 5 is available for MacOS 7.0 and later. Details about which Perl 
version is available for your particular hardware and MacOS version are available at the 
MacPerl web page. 


2.3.5.3 Windows 


Several binaries for different Windows versions are available. Since Windows is closely 
coupled with Intel 32-bit chips, these binaries are often called Wintel or Win32 binaries. 
The current standard Perl distribution is ActivePerl from ActiveState, at 
http://www.activestate.com/ActivePerl/, where you can find complete 
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installation directions. You can also get to ActivePerl via the Downloads button from the 
Perl web site. Under the subheading Binary Distributions, go to Perl for Win32, and then 
click on the ActivePerl site. 


From the ActiveState web site's ActivePerl page, click the Downloads button. You can 
then download the Windows-Intel binary. Note that installing it requires a program called 
Windows Installer, which is available at ActivePerl if it's not already on your computer. 


2.4 How to Run Perl Programs 


The details of how to run Perl vary depending on your operating system. The instructions 
that come with your Perl installation contain all you need to know. I'll give short 
summaries here, just enough to get you started. 


2.4.1 Unix or Linux 


On Unix or Linux, you usually run Perl programs from the command line. If you're in the 
same directory as the program, you can run a Perl program in a file called this_program 
by typing perl this program. If you're not in the same directory, you may have to give 


the pathname of the program, for example: 
perl /usr/local/bin/this program 


Usually, you set the first line of this_program to have the correct pathname for Perl on 
your system, because different machines may have installed Perl in different directories. 


On my computer, I use the following as the first line of my Perl programs: 
#!/usr/bin/perl 





You can type which perl] to find the pathname where Perl is installed on your system. 
You can make the program executable using the chmod program: for instance, you can type: 


chmod 755 this program 

If you've set the first line correctly and used chmod, you can just type the name of the 
Perl program to run it. So, if you're in the same directory as the program, you can 
type ./this program. If the program is in a directory that's included in your $PATH or 
$path variable, you can type this program.2! 


2] sparu is the variable used for the sh, bash, and ksh shells; $path is used for csh and tcsh. 


If your Perl program doesn't run, the error messages you get from the shell in the 
command window may be confusing. For instance, the bash shell on my Linux system 


gives the error message: 
bash: ./my_ program: No such file or directory 


in two cases: if there really is no program called my_program in the current directory or 
if the first line of my_program has incorrectly given the location of Perl. Watch for that, 
especially when running programs from CPAN (see Appendix A), which may have 
different pathnames for Perl embedded in their first lines. Also, if you type my_ program, 


you may get this error message: 
bash: my program: command not found 
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which means that the operating system can't find the program. But it's there in your 
current directory! The problem is probably that your $PATH or $path variable doesn't 
include the current directory, and so the system isn't even looking in the current directory 
for the program. In this case, change the $PATH or $path variable (depending on which 
shell you're using), or just type /my_ program instead of my program. 


2.4.2 Macs 


On Macs, the recommended way to save Perl programs is as "droplets"; the MacPerl 
documentation gives the simple instructions. Basically, you open the Perl program with 
the MacPerl application and then choose Save As and select the Type option Droplet. 


You can drag and drop a file onto a droplet in order to use the file as input (via the @ARGV 
array—see the discussion in Chapter 6). 


The new MacOS X is a Unix system on which you have the option of running Perl 
programs from the command line as described earlier for Unix and Linux systems. 


2.4.3 Windows 


On Windows systems, it's usual to associate the filename extension .p/ with Perl 
programs. This is done as part of the Perl installation process, which modifies the registry 
settings to include this file association. You can then launch this_program.p! by typing 
this program in an MS-DOS command window or by typing perl. this program.pl. 
Windows has a PATH variable specifying folders in which the system looks for programs, 
and this is modified by the Perl installation process to include the path to the folder for 
the Perl application, usually c:\per/. If you're trying to run a Perl program that isn't 
installed in a folder known to the PATH variable, you can type the complete pathname to 
the program, for instance perl! c:\windows\desktop\my_program.pl. 


2.5 Text Editors 


Now that you've set up your computer and installed Perl, you need to select and learn the 
basics of a text editor. A text editor is used to type documents, such as programs, and to 
save the contents of those documents into files. So to write a Perl program, you need to 
use a text editor. This can be a medium-sized learning job if you have never used an 
editor before, although some text editors are easy to learn. Here are some examples of the 
most popular editors, arranged by operating-system type: 


Unix or Linux 


vi and emacs are complex (but very good) editors. pico, xedit, and several others 
(nedit, gedit, kedit) are easy to use and simple to learn but less powerful. There is 
also a free, Microsoft Word-compatible editor included in StarOffice (but be sure 
to save your files as ASCII or text-only). 


Macintosh 
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The built-in editor that comes with MacPerl is fine. There is also a nice 
commercial editor called BBEdit that is optimized for Perl, as well as a 
freeware version called BBEdit Lite. You can also use the Alpha shareware 
editor or Microsoft Word (be sure to save as ASCII text only). 


Windows 


Notepad works satisfactorily and may already be familiar; Microsoft Word is also 
usable, but always save as ASCII or text-only. Emacs on Windows is highly 
recommended for Perl programming on Windows-based computers, but it's a little 
complicated to learn. There are many other editors as well; I use a free version of 
the Unix editor vi called vim that has been ported to Windows. 


Many other text editors are available. Most computers come with a choice of several 
editors. (Many programmers try their hand at writing an editor or extending an already 
existing editor at some point in their careers, so the choices are truly legion.) 


Some editors are very simple to learn and use. Others have a huge variety of features, 
their own instruction books and discussion groups and web sites and so on, and can take 
quite a while to learn. If you're a new programmer, pick an easy one and save yourself the 
headache. Later, if you feel adventurous, you can graduate to a fancier editor with 
features that can speed your work. Not sure what is available on your computer? Ask for 
help from a programmer or another user, or consult the documentation that came with 
your computer system. 


2.6 Finding Help 


Make sure you have the necessary documentation. If you installed Perl as outlined earlier, 
documentation is installed as part of the general Perl installation, and the instructions that 
come with your Perl distribution explain how to get the documentation. There is also 
excellent online documentation; look for it at the Perl home page. 


Programming resources are places to look for answers to programming questions. Perl 
resources are essential to doing Perl programming. Check out Appendix A to learn where 
to find resources such as books, online documentation, working programs, newsgroups, 
archives, journals, and conferences. 


As you get involved in programming, you will learn the most important books, web sites, 
Internet newsgroups and their searchable archives, local gurus (experts in the subject at 
hand), and program documentation. This includes programming manuals (printed or 
online) and frequently asked question (FAQs). 


Most languages have a standard document set that includes the whole story about the 
language definition and use. Perl's is included with the program as the online manual. 
Although programming manuals often suffer from poor writing, it's best to be prepared to 
dig into them. A well-honed ability to skim is a great asset. The Perl manual isn't bad; its 
main problem is that, as with most manuals, all the details are there, so it can be a bit 
overwhelming at first. However, the Perl documentation does a decent job of helping the 
beginner navigate, by means of tutorial documents. 
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Finally, I urge you, the beginning programmer, to find some experienced Perl 
programmer who can answer the occasional question. This may be your teacher or 
teaching assistant in a course, a coworker, someone down at the local computer store, or 
someone replying to your posting on an online newsgroup (there are newsgroups 
specifically for Perl beginners). Chances are that an occasional conversation with an 
experienced user can save you many hours of chasing deadends during your initial 
learning stages. Many programmers are happy to lend a hand or offer advice to beginners, 
there's a friendly and collegial atmosphere that prevails in the programming community. 


Be warned, however: experts can become irritated at people who continually pose 
questions whose answers are readily available in FAQs and other standard documentation. 
You might sometimes see the advice to RTFM—acronym for Read The F(ine) Manual— 
in response to such questions. So do a little checking around in the FAQs before 
repeatedly asking for someone's valuable time. 


(I can't resist the occasional anecdote.) At my first programming job, which I took to 
learn programming, I was stumped by a problem for which there seemed to be no obvious 
solution. I approached the person who had been cited as the best programmer in the 
laboratory. I carefully explained my predicament as he patiently listened. When I was 
done, he smiled and advised, "Be a man. Do it yourself." I was crestfallen and retired in 
confusion. But as it turned out, his advice was given with tongue in cheek, and he later 
approached me and gave me pointers that led to a solution. 
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Chapter 3. The Art of Programming 


This chapter provides an overview of how programmers accomplish their jobs. If you 
already have Perl installed, and you want to get started writing programs for 
bioinformatics, feel free to skip ahead to Chapter 4. 


Just as visitors to a biology lab tend to have a clueless awe of "all those test tubes," so the 
newcomer to programming may regard the world of the programmer as a kind of arcane 
black box full of weird terminology and abstruse skills. So, to make the whole enterprise 
a little more congenial, let's take a short tour of some important realities that affect all 
programmers. Two of the most important are practical strategies that good programmers 
use and where to go to find answers to questions that arise while you are programming. 
Using a couple of brief narrative case studies, we'll look at how programmers find 
solutions to problems. Appendix A lists some of the best Perl and bioinformatics 
resources to help you solve your particular problems. 


3.1 Individual Approaches to Programming 


What's the best way to learn programming? The answer depends on what you hope to 
accomplish. There are several ways to get started. You can: 


Take classes of many different kinds 

Read a tutorial book like this one 

Get the programming manuals and plunge in 

Be tutored by a programmer 

Identify a program you need 

Try any and all of the above until you've managed to write the program 

The answer also depends on how you choose to learn. Some people prefer classes, 
because the information is often presented in a well-organized way, and questions can be 
answered by the teacher. Others learn best with self-paced study. 

Some things about learning to program are common to all these approaches. If you've 


never programmed at all, the information in the following sections is a "heads-up" about 
what's ahead. 


3.2 Edit—Run—Revise (and Save) 


The most important thing about programming is that it's a hands-on learning activity such 
as dancing, playing music, cooking, or some other family-oriented activity. You can read 
about it, but you can't actually do it until you actually do it. 
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While learning to program in Perl, you need to read about how Perl works, as you will in 
the chapters that follow. You also need to look at plenty of examples of programs. But 
you especially need to attempt to write your own programs, as you are asked to do in the 
exercises at the end of the later chapters. Only this kind of direct experience will make 
you a programmer. 


So I want to give you an overview of the most important tasks involved in writing 
programs, to help you approach your first programs with a clearer idea of what's really 
involved. 


What exactly will you be doing at the computer? The bulk of a programmer's work 
involves the steps of writing or revising a program in an editor, then running the program 
and watching how it behaves, and on the basis of that behavior going back and revising 
the program again. A typical programmer spends more than half of his or her time editing 
the program. 


3.2.1 Saves and Backups 


Once you have even a few lines of code written, it's important to save it. In fact, you 
should always remember to save a version of your program at regular intervals during 
editing, so if you make a bunch of edits and the computer crashes, you don't lose hours of 
work. Also, make sure you back up your work on another disk. Hard disks fail, and when 
yours does, the information on it will be lost. Therefore it's essential to make regular 
(daily) backups of your work onto some other medium—tape, floppy disk, Zip disk, 
another hard disk, writable CD—whatever, just so you won't lose all your work if a disk 
failure occurs. 


In addition to backups of your disks, it's also a good idea to save a dated version of your 
program at regular intervals. This will allow you to go back to an earlier version of your 
program should that prove necessary. 


It's also a good idea to make sure the backups you're making actually work. So, for 
instance, if you're backing up to a tape drive, try restoring the files from your tape drive 
every once in a while, just to make sure that the software and the tapes themselves are all 
working. You may also want to print out ("make a hardcopy") of your programs at 
regular intervals for extra insurance against system failures. Finally, it's good policy to 
keep the backups somewhere away from the computer, so in case of fire or other disaster, 
the backups will be safe. 


3.2.2 Error Messages 


Fixing errors is an essential step in writing programs. After you've written and edited a 
program, the next step is to run it to see if it works. Very often, you'll find that you've 
made some typographical error, like forgetting to put in a semicolon. As a result, your 
program isn't valid, and you'll get various error messages from the system. You then have 
to read the error messages and reedit your program to repair the offending code. 
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These error messages are sometimes rather cryptic. In the event of an error, the Perl 
interpreter may have some trouble knowing exactly where you went wrong. It may only 
recognize that there is something wrong. So it guesses where the problem is, and in the 
process, it may give you some extraneous information. 


The most important thing about using error messages is to look at the first one or two 
error messages and ignore the rest; fix the top problems, and try running the program 
again. Error messages are often verbose and can run on for several pages. Just ignore 
everything but the first errors reported. Another important point is that the line numbers 
reported in those first error messages are usually right. Sometimes they're off by a line, 
and they're rarely way off. Later on, we'll practice generating and reading error messages. 


3.2.3 Debugging 


Perhaps your edits created a valid program, and the Perl interpreter reads in your program 
and runs it. You find, however, that the program isn't doing what you want it to do. Now 
you have to go back, look at the program, and try to figure out what's wrong. 


Perhaps you made a simple mistake, such as adding instead of subtracting. You may have 
misread the documentation, and you're using the language the wrong way (reread the 
documentation). You may simply have an inadequate plan for accomplishing your goal 
(rethink your strategy and reprogram that part of the code). Sometimes you can't see 
what's wrong, and you have to look elsewhere (try searching newsgroup archives or 
FAQs or asking colleagues for help). 


For errors that are difficult to find, there are programs called debuggers that allow you to 
run the program step by step, looking at everything that's happening in the program. 
(Chapter 6 takes an in-depth look at Perl's debugger.) 


There are other tools and techniques you can use. For instance, you can examine your 
program by adding print statements that print out intermediate values or results. There 
are also special helper programs that can observe your program while it's running and 
then report about it, telling you, for instance, about where the program is spending most 
of its time. These tools, and others like them, are essential to programming, and you need 
to learn how to use them. 


3.3 An Environment of Programs 


Programming is an exercise in problem solving. It's an iterative, gradual process. 
Although it can be done by one person alone, it's often a social activity (this surprises 
many newcomers). It requires developing specific problem-solving skills and learning a 
few tools. Programming is sometimes tricky and can be frustrating. On the other hand, 
for those with an aptitude, there's a great sense of satisfaction that comes from building a 
working program. 


Computer programs can be many things, from barely useful, to aesthetically and 
intellectually stimulating, to important generators of new knowledge. They can be 
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beautiful. (They can also be destructive, stupid, silly, or vicious; they are human creations, 
after all.) Because writing a program is an iterative, building, gradual process, there can 
be real satisfaction in seeing the work unfold from simple beginnings to complete 
structures. For the beginning student, this gradual unfolding of a new program mirrors the 
gradual mastery of the language. 


As our culture began writing and accumulating programs in the middle of the 20th 
century, a programming environment began to develop. Gradually, we've been 
accumulating a substantial body of procedural knowledge. Programs often reflect the fact 
that they swim in waters populated by many other programs, and beginning programmers 
can expect to learn a lot from this environment. 


3.3.1 Open Source Programs 


As programming has become important in the world, it has also become economically 
valuable. As a result, the source code for many programs is kept hidden to protect 
commercial assets and stymie the competition. 


However, the source code for many of the best and most used programs are freely 
available for anyone to examine. Freely available source code is called open source. 
(There are various kinds of copyrights that may attach to open source program code, but 
they all allow anyone to examine the source code.) The open source movement treats 
program source code in a similar manner to the way scientists publish their results: 
publicly and open to unfettered examination and discussion. 


The source code for these programs can be a wonderful place for the beginning 
programmer to learn how professional programmers write. The programs available in 
open source include the Perl interpreter and a large amount of Perl code, the Linux 
operating system, the Apache web server, the Netscape web browser, the sendmail mail 
transfer agent, and much more. 


3.4 Programming Strategies 


In order to give you, the beginning programmer, an idea of how programming is done, 
let's see how an experienced programmer goes about solving problems by giving a couple 
of instructive case studies. 


Imagine that you want to count all the regulatory elements™ in a large chunk of DNA that 
you just got from the sequencing lab. You're a professional bioinformatics programmer. 
What do you do? There are two possible solutions: find a program or write one yourself. 


[1 A regulatory element is a stretch of DNA used by the cell in the control of a coding region, 
helping to determine if and when it's used to create a protein. 


It's likely there is already a perfectly good, working, and maybe even free program that 
does exactly what you need. Very often, you can find exactly what you need on the Web 
and avoid the cost and expense of reinventing the wheel. This is programming at its 
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best—minimal work for maximal effect. It's the classic case of the experimentalist's 
adage: a day in the library can save you six months in the lab. 


An important part of the art of programming is to keep aware of collections of programs 
that are available. Then you can simply use the code if it does exactly what you need, or 
you can take an existing program and alter it to suit your own needs. Of course, copyright 
laws must be observed, but much is available at no cost, especially to educational and 
nonprofit organizations. Most Perl module code has a copyright, but you are allowed to 
use it and modify it given certain restrictions. Details are available at the Perl web site 
and with the particular modules. 


How do you find this wonderful, free, and already existing program? The Perl 
community has an organized collection of such programming code at the Comprehensive 
Perl Archive Network (CPAN) web site, http://www.CPAN.org. Try exploring: 
you'll find it's organized by topic, so it's possible to quickly find, for example, web, 
statistics, or graphics programs. In our case, you will find the Bioperl module, which 
includes several useful bioinformatics functions. A module is a collection of Perl code 
that can be easily loaded and used by your Perl programs. 


The most useful kinds of code are convenient libraries or modules that package a suite of 
functions. These packages offer a great deal of flexibility in creating new programs. 
Although you still have to program, the job may be only a small fraction of the work of 
writing the whole program from scratch. For instance, to continue our example of looking 
for regulatory elements, your search may turn up a convenient module that lists the 
regulatory elements plus code that takes a list of elements and searches for them in a 
DNA library. Then all you have to do is combine the existing code, provide the DNA 
library, and with a little bit of programming, you're done. 


There are lots of other places to look for already existing code. You can search the 
Internet with your favorite search engines. You can browse collections of links for 
bioinformatics, looking for programs. You can also search the other sources we've 
already covered, such as newsgroups, relevant experts, etc. 


If you haven't hit paydirt yet, and you know that the program will take a significant 
amount of time to write yourself, you may want to search the literature in the library, and 
perhaps enlist the aid of a librarian. You can search Medline for articles about regulatory 
elements, since often an article will advertise code (an actual program in a language like 
Perl) that the authors will forward. You can consult conference proceedings, books, and 
journals. Conferences and trade shows are also great places to look around, meet people, 
and ask questions. 


In many cases you succeed, and despite the effort involved, you saved yourself and your 
laboratory days, weeks, or months of effort. 


However, one big warning about modifying existing code: depending on how much 
alteration is required, it can sometimes be more difficult to modify existing code than to 
write a whole program from scratch. Why? Well, depending on who wrote the program, 
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it may be difficult just to see what the different parts of the code do. You can't make 
modifications if you can't understand what methods the program uses in the first place. 
(We'll talk more about writing readable code, and the importance of comments in code, 
later.) This factor alone accounts for a large part of the expense of programming; many 
programs can't be easily read, or understood, so they can't be maintained. Also, testing 
the program may be difficult for various reasons, and it may take a lot of time and effort 
to assure yourself that your modifications are working correctly. 


Okay, let's say that you spent three days looking for an existing program, and there really 
wasn't anything available. (Well, there was one program, but it cost $30,000 which is 
way outside your budget, and your local programming expert was too busy to write one 
for you.) So you absolutely have to write the program yourself. 


How do you start from scratch and come up with a program that counts the regulatory 
elements in some DNA? Read on. 


3.5 The Programming Process 


You've been assigned to write a program that counts the regulatory elements in DNA. If 
you've never programmed you probably have no idea of how to start. Let's talk about 
what you need to know to write the program. 


Here's a summary of the steps we'll cover: 
Identify the required inputs, such as data or information given by the user. 


Make an overall design for the program, including the general method—the algorithm— 
by which the program computes the output. 


Decide how the outputs will print; for example, to files or displayed graphically. 
Refine the overall design by specifying more detail. 
Write the Perl program code. 


These steps may be different for shorter or longer programs, but this is the general 
approach you will take for most of your programming. 


3.5.1 The Design Phase 


First, you need to conceive a plan for how the program is going to work. This is the 
overall design of the program and an important step that's usually done before the actual 
writing of the program begins. Programs are often compared to kitchen recipes, in that 
they are specific instructions on how to accomplish some task. For instance, you need an 
idea of what inputs and outputs the program will have. In our example, the input would 
be the new DNA. You then need a strategy for how the program will do the necessary 
computing to calculate the desired output from the input. 


IT-SC 33 


In our example, the program first needs to collect information from the user: namely, 
where is the DNA? (This information can be the name of a file that contains the computer 
representation of the DNA sequence.) The program needs to allow the user to type in the 
name of a datafile, maybe from the computer screen or from a web page. Then the 
program has to check if the file exists (and complain if not, as might happen, for instance, 
if the user misspelled the name) and finally open the file and read in the DNA before 
continuing. 


This simple step deserves some comment. You can put the DNA directly into the 
program code and avoid having to write this whole part of the program. But by designing 
the program to read in the DNA, it's more useful, because you won't have to rewrite the 
program every time you get some new DNA. It's a simple, even obvious idea, but very 
powerful. 


The data your program uses to compute is called the input . Input can come from files, 
from other programs, from users running the program, from forms filled out on web sites, 
from email messages, and so forth. Most programs read in some form of input; some 
programs don't. 


Let's add the list of regulatory elements to the actual program code. You can ask for a file 
that contains this list, as we did with the DNA, and have the program be capable of 
searching different lists of regulatory elements. However, in this case, the list you will 
use isn't going to change, so why bother the user with inputting the name of another file? 


Now that we have the DNA and the list of regulatory elements you have to decide in 
general terms how the program is actually going to search for each regulatory element in 
the DNA. This step is obviously the critical one, so make sure you get it right. For 
instance, you want the program to run quickly enough, if the speed of the program is an 
important consideration. 


This is the problem of choosing the correct algorithm for the job. An algorithm is a 
design for computing a problem (I'll say more about it in a minute). For instance, you 
may decide to take each regulatory element in turn and search through the DNA from 
beginning to end for that element before going on to the next one. Or perhaps you may 
decide to go through the DNA only once, and at each position check each of the 
regulatory elements to see if it is present. Is there be any advantage to one way or the 
other? Can you sort the list of regulatory elements so your search can proceed more 
quickly? For now, let's just say that your choice of algorithm is important. 


The final part of the design is to provide some form of output for the results. Perhaps you 
want the results displayed on a web page, as a simple list on the computer screen, in a 
printable file, or perhaps all of the above. At this stage, you may need to ask the user for a 
filename to save the output. 


This brings up the problem of how to display results. This question is actually a critically 
important one. The ideal solution is to display the results in a way that shows the user at a 
glance the salient features of the computation. You can use graphics, color, maps, little 
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bouncing balls over the unexpected result: there are many options. A program that 
outputs results that are hard to read is clearly not doing a good job. In fact, output that 
makes the salient results hard to find or understand can completely negate all the effort 
you put into writing an elegant program. Enough said for now. 


There are several strategies employed by programmers to help create good overall 
designs. Usually, any program but the smallest is written in several small but 
interconnecting parts. (We'll see lots of this as we proceed in later chapters.) What will 
the parts be, and how will they interconnect? The field of software engineering addresses 
these kinds of issues. At this point I only want to point out that they are very important 
and mention some of the ways programmers address the need for design. 


There are many design methodologies; each have their dedicated adherents. The best 
approach is to learn what is available and use the best methodology for the job at hand. 
For instance, in this book I'm teaching a style of programming called imperative 
programming , relying on dividing a problem into interacting procedures or 
subroutines (see Chapter 6), known as structured design. Another popular style 
is called object-oriented programming, which is also supported by Perl. 


If you're working in a large group of programmers on a big project, the design phase can 
be very formal and may even be done by different people than the programmers 
themselves. On the other end of the scale, you will find solitary programmers who just 
start writing, developing a plan as they write the code. There is no one best way that 
works for everyone. But no matter how you approach it, as a beginner you still need to 
have some sort of design in mind before you start writing code. 


3.5.2 Algorithms 


An algorithm is the design, or plan, for the computation done by a computer program. 
(It's actually a tricky term to define, outside of a formal mathematical system, but this is a 
reasonable definition.) An algorithm is implemented by coding it in a specific computer 
language, but the algorithm is the idea of the computation. It's often well represented in 
pseudocode, which gives the idea of a program without actually being a real computer 
program. 


Most programs do simple things. They get filenames from users, open the files, and read 
in the data. They perform simple calculations and display the results. These are the types 
of algorithms you'll learn here. 


However, the science of algorithms is a deep and fruitful one, with many important 
implications for bioinformatics. Algorithms can be designed to find new ways of 
analyzing biological data and of discovering new scientific results. There are certainly 
many problems in biology whose solutions could be, and will be, substantially advanced 
by inventing new algorithms. 


The science of algorithms includes many clever techniques. As a beginning programmer, 
you needn't worry about them just yet. At this stage, an introductory chapter in a 
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beginning tutorial on programming, it's not reasonable to go into details about 
algorithmic methods. Your first task is just to learn how to write in some programming 
language. But if you keep at it, you'll start to learn the techniques. A decent textbook to 
keep around as a reference is a good investment for a serious programmer (see 


Appendix A). 


In the current example that counts regulatory elements in DNA, I suggest a way of 
proceeding. Take each regulatory element in turn, and search through the DNA for it, 
before proceeding to the next regulatory element. Other algorithms are also possible; in 
fact, this is one example from the general problem called string matching , which is one 
of the most important for bioinformatics, and the study of which has resulted in a variety 
of clever algorithms. 


Algorithms are usually grouped by such problems or by technique, and there is a wealth 
of material available. For the practical programmer, some of the most valuable materials 
are collections of algorithms written in specific languages, that can be incorporated into 
your programs. Use Appendix A as a starting place. Using the collections of code and 
books given there, it's possible to incorporate many algorithmic techniques in your Perl 
code with relative ease. 


3.5.3 Pseudocode and Code 


Now you have an overall design, including input, algorithm, and output. How do you 
actually turn this general idea into a design for a program? 


A common implementation strategy is to begin by writing what is called pseudo-code. 
Pseudocode is an informal program, in which there are no details, and formal syntax isn't 
followed.” It doesn't actually run as a program; its purpose is to flesh out an idea of the 
overall design of a program in a quick and informal way. 


1 Syntax refers to the rules of grammar. English syntax decrees, "Go to school" not "School 
go to." Programming languages also have syntax rules. 


For example, in an actual Perl program you might write a bit of code called a subroutine 
(see Chapter 6), in this case, a subroutine that gets an answer from a user typing at the 
keyboard. Such a subroutine may look like this: 
sub getanswer { 

print "Type in your answer here :"; 

my Sanswer = <STDIN>; 

chomp Sanswer; 

return Sanswer; 


} 
But in pseudocode, you might just say: 
getanswer 


and worry about the details later. 
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Here's an example of pseudocode for the program I've been discussing: 





get the name of DNAfile from the user 
read in the DNA from the DNAfile 


for each regulatory element 
if element is in DNA, then 
add one to the count 


print count 


3.5.4 Comments 


Comments are parts of Perl source code that are used as an aid to understanding what the 
program does. Anything from a # sign to the end of a line is considered a comment and is 
ignored by the Perl interpreter. (The exception is the first line of many Perl programs, 
which looks something like this: #!/usr/bin/perl; see Section 4.2.3 in 


Chapter 4.) 


Comments are of considerable importance in keeping code useful. They typically include 
a discussion of the overall purpose and design of the program, examples of how to use 
the program, and detailed notes interspersed throughout the code explaining why that 
code is there and what it does. In general, a good programmer writes good comments as 
an integral part of the program. You'll see comments in all the programming examples in 
this book. 


This is important: your code has to be readable by humans as well as computers. 


Comments can also be useful when debugging misbehaving programs. If you're having 
trouble figuring out where a program is going wrong, you can try to selectively comment 
out different parts of the code. If you find a section that, when commented out, removes 
the problem, you can then narrow down the part you've commented out until you have a 
fairly short section of code in which you know where the problem is. This is often a 
useful debugging approach. 


Comments can be used when you turn pseudocode into Perl source code. Pseudocode is 
not Perl code, so the Perl interpreter will complain about any pseudocode that is not 
commented out. You can comment out the pseudocode by placing # signs at the 
beginning of all pseudocode lines: 

#get the name of DNAfile from the user 





#read in the DNA from the DNAfile 


#for each regulatory element 








# if element is in DNA, then 
# add one to the count 
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fprint count 

As you expand your pseudocode design into Perl code, you can uncomment the Perl code 
by removing the # signs. In this way you may have a mixture of Perl and pseudocode, but 
you can run and test the Perl parts; the Perl interpreter simply ignores commented-out 
lines. 


You can even leave the complete pseudocode design, commented out, intact in the 
program. This leaves an outline of the program's design that may come in handy when 
you or someone else tries to read or modify the code. 


We've now reached the point where we're ready for actual Perl programming. In 
Chapter 4 you will learn Perl syntax and begin programming in Perl. As you do, 
remember the initial phase of designing your program, followed by the cycle you will 
spend most of your time in: editing the program, running the program, and revising the 
program. 
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Chapter 4. Sequences and Strings 


In this chapter you will begin to write Perl programs that manipulate biological sequence 
data, that is, DNA and proteins. Once you have the sequences in the computer, you'll start 
writing programs that do the following with the sequence data: 


Transcribe DNA to RNA 

Concatenate sequences 

Make the reverse complement of sequences 
Read sequence data from files 


You'll also write programs that give information about your sequences. How GC-rich is 
your DNA? How hydrophobic is your protein? You'll see programming techniques you 
can use to answer these and similar questions. 


The Perl skills you will learn in this chapter involve the basics of the language. Here are 
some of those basics: 


Scalar variables 
Atray variables 
String operations such as substitution and translation 


Reading data from files 


4.1 Representing Sequence Data 


The majority of this book deals with manipulating symbols that represent the biological 
sequences of DNA and proteins. The symbols used in bioinformatics to represent these 
sequences are the same symbols biologists have been using in the literature for this same 


purpose. 


As stated earlier, DNA is composed of four building blocks: the nucleic acids, also called 
nucleotides or bases. Proteins are composed of 20 building blocks, the amino acids, also 
called residues. Fragments of proteins are called peptides. Both DNA and proteins are 
essentially polymers, made from their building blocks attached end to end. So it's 
possible to summarize the structure of a DNA molecule or protein by simply giving the 
sequence of bases or amino acids. 


These are brief definitions; I'm assuming you are either already familiar with them or are 
willing to consult an introductory textbook on molecular biology for more specific details. 
Table 4-1 shows bases; add a sugar and you get the nucleotides adenosine, guanosine, 
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cytidine, thymidine, and uridine. You can further add a phosphate and get the nucleotides 
adenylic acid, guanylic acid, cytidylic acid, thymidylic acid, and uridylic acid. A nucleic 
acid is a chemically linked sequence of nucleotides. A peptide is a small number of 
joined amino acids; a longer chain is a polypeptide. A protein is a biologically functional 
unit made of one or more polypeptides. A residue is an amino acid in a polypeptide chain. 


For expediency, the names of the nucleic acids and the amino acids are often represented 
as one- or three-letter codes, as shown in Table 4-1 and Table 4-2. (This book mostly 
uses the one-letter codes for amino acids.) 


Table 4-1. Standard IUB/IUPAC nucleic acid codes 


Code Nucleic Acid(s) 
A Adenine 

C Cytosine 

G Guanine 

T Thymine 

U Uracil 


M A or C (amino) 


R A or G (purine) 


WwW A or T (weak) 





S C or G (strong) 

x C or T (pyrimidine) 
K G or T (keto) 
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N A or G or C or T (any) 





Table 4-2. Standard IUB/IUPAC amino acid codes 


One-letter code Amino acid Three-letter code 
A Alanine Ala 
B Aspartic acid or Asparagine|Asx 
C Cysteine Cys 
D Aspartic acid Asp 
E Glutamic acid Glu 
F Phenylalanine Phe 
G Glycine Gly 
H Histidine His 
I Isoleucine Ile 
K Lysine Lys 
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L Leucine Leu 


M Methionine Met 
N Asparagine Asn 
P Proline Pro 
Q Glutamine Gln 
R Arginine Arg 
S Serine Ser 
T Threonine Thr 
Vv Valine Val 
W Tryptophan Trp 
xX Unknown XXX 
x Tyrosine Tyr 
Z Glutamic acid or Glutamine|Glx 





The nucleic acid codes in Table 4-1 include letters for the four basic nucleic acids; they 
also define single letters for all possible groups of two, three, or four nucleic acids. In 
most cases in this book, I use only A, C, G, T, U, and N. The letters A, C, G, and T 
represent the nucleic acids for DNA. U replaces T when DNA is transcribed into 
ribonucleic acid (RNA). N is the common representation for "unknown," as when a 
sequencer can't determine a base with certainty. Later on, in Chapter 9, we'll need the 
other codes, for groups of nucleic acids, when programming restriction maps. Note that 
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the lowercase versions of these single-letter codes is also used on occasion, frequently for 
DNA, rarely for protein. 


The computer-science terminology is a little different from the biology terminology for 
the codes in Table 4-1 and Table 4-2. In computer-science parlance, these tables 
define two alphabets, finite sets of symbols that can make strings. A sequence of symbols 
is called a string. For instance, this sentence is a string. A language is a (finite or infinite) 
set of strings. In this book, the languages are mainly DNA and protein sequence data. 
You often hear bioinformaticians referring to an actual sequence of DNA or protein as a 
"string," as opposed to its representation as sequence data. This is an example of the 
terminologies of the two disciplines crossing over into one another. 


As you've seen in the tables, we'll be representing data as simple letters, just as written on 
a page. But computers actually use additional codes to represent simple letters. You won't 
have to worry much about this; just remember that when using your text editor to save as 
ASCII, or plain text. 


ASCII is a way for computers to store textual (and control) data in their memory. Then 
when a program such as a text editor reads the data, and it knows it's reading ASCII, it 
can actually draw the letters on the screen in a recognizable fashion because it's 
programmed to know that particular code. So the bottom line is: ASCII is a code to 
represent text on a computer.™ 


[1 ~ new character encoding called Unicode, which can handle all the symbols in all the world's 
languages, is becoming widely accepted and is supported by Perl as well. 


4.2 A Program to Store a DNA Sequence 


Let's write a small program that stores some DNA in a variable and prints it to the screen. 
The DNA is written in the usual fashion, as a string made of the letters A, C, G, and T, 
and we'll call the variable SDNA. In other words, SDNA is the name of the DNA sequence 
data used in the program. Note that in Perl, a variable is really the name for some data 
you wish to use. The name gives you full access to the data. Example 4-1 shows the 
entire program. 


Example 4-1. Putting DNA into the computer 


#!/usr/bin/perl -w 
# Storing DNA in a variable, and printing it out 





# First we store the DNA in a variable called SDNA 
SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 


# Next, we print the DNA onto the screen 
print SDNA; 


# Finally, we'll specifically tell the program to exit. 
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exit; 


Using what you've already learned about text editors and running Perl programs in 
Chapter 2, enter the code (or copy it from the book's web site) and save it to a file. 
Remember to save the program as ASCII or text-only format, or Perl may have trouble 
reading the resulting file. 


The second step is to run the program. The details of how to run a program depend on the 
type of computer you have (see Chapter 2). Let's say the program is on your computer 
in a file called example4-1. As you recall from Chapter 2, if you are running this 
program on Unix or Linux, you type the following in a shell window: 


perl example4-1 


On a Mac, open the file with the MacPerl application and save it as a droplet, then just 
double-click on the droplet. On Windows, type the following in an MS-DOS command 
window: 


perl example4 -l 


If you've successfully run the program, you'll see the output printed on your computer 
screen. 


4.2.1 Control Flow 


Example 4-1 illustrates many of the ideas all our Perl programs will rely on. One of 
these ideas is control flow , or the order in which the statements in the program are 
executed by the computer. 


Every program starts at the first line and executes the statements one after the other until 
it reaches the end, unless it is explicitly told to do otherwise. Example 4-1 simply 
proceeds from top to bottom, with no detours. 


In later chapters, you'll learn how programs can control the flow of execution. 
4.2.2 Comments Revisited 


Now let's take a look at the parts of Example 4-1. You'll notice lots of blank lines. 
They're there to make the program easy for a human to read. Next, notice the comments 
that begin with the # sign. Remember from Chapter 3 that when Perl runs, it throws 
these away along with the blank lines. In fact, to Perl, the following is exactly the same 
program as Example 4-1: 

#!/usr/bin/perl -w 

SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print SDNA; 

exit; 

In Example 4-1, I've made liberal use of comments. Comments at the beginning of 
code can make it clear what the program is for, who wrote it, and present other 
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information that can be helpful when someone needs to understand the code. Comments 
also explain what each section of the code is for and sometimes give explanations on how 
the code achieves its goals. 


It's tempting to belabor the point about the importance of comments. Suffice it to say that 
in most university-level, computer-science class assignments, the program without 
comments typically gets a low or failing grade; also, the programmer on the job who 
doesn't comment code is liable to have a short and unsuccessful career. 


4.2.3 Command Interpretation 


Because it starts with a # sign, the first line of the program looks like a comment, but it 
doesn't seem like a very informative comment: 


#!/usr/bin/perl -w 


This is a special line called command interpretation that tells the computer running Unix 
and Linux that this is a Perl program. It may look slightly different on different 
computers. On some machines, it's also unnecessary because the computer recognizes 
Perl from other information. A Windows machine is usually configured to assume that 
any program ending in .p/ is a Perl program. In Unix or Linux, a Windows command 
window, or a MacOS X shell, you can type perl my program, and your Perl 
program my program won't need the special line. However, it's commonly used, so 
we'll have it at start all our programs. 


Notice that the first line of code uses a flag -w. The "w" stands for warnings, and it 
causes Perl to print messages in case of an error. Very often the error message suggests 
the line number where it thinks the error began. Sometimes the line number is wrong, but 
the error is usually on or just before the line the message suggests. Later in the book, 
you'll also see the statement use warnings as an alternative to —w. 


4.2.4 Statements 


The next line of Example 4-1 stores the DNA in a variable: 
SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 


This is a very common, very important thing to do in a computer language, so let's take a 
leisurely look at it. You'll see some basic features about Perl and about programming 
languages in general, so this is a good place to stop skimming and actually read. 


This line of code is called a statement. In Perl, statements end in a semicolon (;). The use 
of the semicolon is similar to the use of the period in the English language. 


To be more accurate, this line of code is an assignment statement. Its purpose in this 
program is to store some DNA into a variable called SDNA. There are several 
fundamental things happening here as you will see in the next sections. 
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4.2.4.1 Variables 


First, let's look at the variable SDNA. Its name is somewhat arbitrary. You can pick 
another name for it, and the program behaves the same way. For instance, if you replace 
the two lines: 

SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 


print SDNA; 
with these: 


SA_poem by Seamus Heaney = 
"ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 


print SA_poem by Seamus Heaney; 


the program behaves in exactly the same way, printing out the DNA to the computer 
screen. The point is that the names of variables in a computer program are your choice. 
(Within certain restrictions: in Perl, a variable name must be composed from upper- or 
lowercase letters, digits, and the underscore _ character. Also the first character must not 
be a digit.) 


This is another important point along the same lines as the remarks I've already made 
about using blank lines and comments to make your code more easily read by humans. 
The computer attaches no meaning to the use of the variable name $DNA instead of 
SA_poem by Seamus Heaney, but whoever reads the program certainly will. One 
name makes perfect sense, clearly indicates what the variable is for in the program, and 
eases the chore of understanding the program. The other name makes it unclear what the 
program is doing or what the variable is for. Using well-chosen variable names is part of 
what's called self-documenting code. You'll still need comments, but perhaps not as many, 
if you pick your variable names well. 


You've noticed that the variable name $DNA starts with dollar sign. In Perl this kind of 
variable is called a scalar variable, which is a variable that holds a single item of data. 
Scalar variables are used for such data as strings or various kinds of numbers (e.g., the 
string hello or numbers such as 25, 6.234, 3.5E10, -0.8373). A scalar variable holds 
just one item of data at a time. 


4.2.4.2 Strings 


In Example 4-1, the scalar variable $DNA is holding some DNA, represented in the 
usual way by the letters A, C, G, and T. As stated earlier, in computer science a sequence 
of letters is called a string. In Perl you designate a string by putting it in quotes. You can 
use single quotes, as in Example 4-1, or double quotes. (You'll learn the difference 
later.) The DNA is thus represented by: 
"ACGGGAGGACGGGAAAATTACTACGGCATTAGC! 


4.2.4.3 Assignment 


In Perl, to set a variable to a certain value, you use the = sign. The = sign is called the 
assignment operator . In Example 4-1, the value: 
"ACGGGAGGACGGGAAAATTACTACGGCATTAGC' 
is assigned to the variable SDNA. After the assignment, you can use the name of the 


variable to get the value, as in the print statement in Example 4-1. 


The order of the parts is important in an assignment statement. The value assigned to 
something appears to the right of the assignment operator. The variable that is assigned a 
value is always to the left of the assignment operator. In programming manuals, you 
sometimes come across the terms Ivalue and rvalue to refer to the left and right sides of 
the assignment operator. 


This use of the = sign has a long history in programming languages. However, it can be a 
source of confusion: for instance, in most mathematics, using = means that the two 
things on either side of the sign are equal. So it's important to note that in Perl, the = sign 
doesn't mean equality. It assigns a value to a variable. (Later, we'll see how to represent 
equality.) 


So, to summarize what we've learned so far about this statement: 


SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC!; 
It's an assignment statement that sets the value of the scalar variable S$DNA to a string 
representing some DNA. 


4.2.4.4 Print 


The statement: 


print SDNA; 

prints ACGGGAGGACGGGAAAATTACTACGGCATTAGC out to the computer screen. 
Notice that the print statement deals with scalar variables by printing out their 
values—in this case, the string that the variable $DNA contains. You'll see more about 
printing later. 


4.2.4.5 Exit 


Finally, the statement exit; tells the computer to exit the program. Perl doesn't require 
an exit statement at the end of a program; once you get to the end, the program exits 
automatically. But it doesn't hurt to put one in, and it clearly indicates the program is over. 
You'll see other programs that exit if something goes wrong before the program normally 
finishes, so the exit statement is definitely useful. 


4.3 Concatenating DNA Fragments 
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Now we'll make a simple modification of Example 4-1 to show how to concatenate 
two DNA fragments. Concatenation is attaching something to the end of something else. 
A biologist is well aware that joining DNA sequences is a common task in the biology 
lab, for instance when a clone is inserted into a cell vector or when splicing exons 
together during the expression of a gene. Many bioinformatics software packages have to 
deal with such operations; hence its choice as an example. 

Example 4-2 demonstrates a few more things to do with strings, variables, and print 
statements. 


Example 4-2. Concatenating DNA 


#!/usr/bin/perl -w 
# Concatenating DNA 


# Store two DNA fragments into two variables called SDNA1 
and SDNA2 

SDNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 

SDNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; 





# Print the DNA onto the screen 
print "Here are the original two DNA fragments:\n\n"; 





print SDNAL, "\n"; 


print SDNAZ, “\n\n"? 








# Concatenate the DNA fragments into a third variable and 
print them 
# Using "string interpolation" 
SDNA3 = "SDNAI1SDNA2"; 











print "Here is the concatenation of the first two fragments 
(version 1)2\n\n"; 





print "SDNA3\n\n"; 


# An alternative way using the "dot operator": 

# Concatenate the DNA fragments into a third variable and 
print them 

SDNA3 = SDNA1 . SDNA2; 





print "Here is the concatenation of the first two fragments 
(version 2)¢\n\n"; 


print "SDNA3\n\n"; 


# Print the same thing without using the variable SDNA3 


print "Here is the concatenation of the first two fragments 
(jersion 3) 2\n\n"’s 


print SDNAL, SDNAZ, "\n"; 
exit; 


As you can see, there are three variables here, SDNA1, SDNA2, and SDNA3. I've added 
print statements for a running commentary, so that the output of the program that 
appears on the computer screen makes more sense and isn't simply some DNA fragments 
one after the other. 


Here's what the output of Example 4-2 looks like: 





Here are the original two DNA fragments: 


ACGGGAGGACGGGAAAAT TACTACGGCATTAGC 
ATAGTGCCGTGAGAGTGATGTAGTA 


Here is the concatenation of the first two fragments 
(version 1): 


ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA 


Here is the concatenation of the first two fragments 
(version 2): 


ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA 


Here is the concatenation of the first two fragments 
(version 3): 


ACGGGAGGACGGGAAAATTACTACGGCAT TAGCATAGTGCCGTGAGAGTGATGTAGTA 
Example 4-2 has many similarities to Example 4-1. Let's look at the differences. To 
start with, the print statements have some extra, unintuitive parts: 

print. SDNAl, "Ww"; 








print SDNAZ, “\n\n"; 


The print statements have variables containing the DNA, as before, but now they also 
have a comma and then "\n" or "\n\n". These are instructions to print newlines. A 
newline is invisible on the page or screen, but it tells the computer to go on to the 
beginning of the next line for subsequent printing. One newline, "\n", simply positions 
you at the beginning of the next line. Two new lines, "\n\n", moves to the next line and 
then positions you at the beginning of the line after that, leaving a blank line in between. 


Look at the code for Example 4-2 and to make sure you see what these newline 


directives do to the output. A blank line is a line with nothing printed on it. Depending on 
your operating system, it may be just a newline character or a combination formfeed and 
carriage return (in which cases, it may also be called an empty line), or it may include 
nonprinting whitespace characters such as spaces and tabs. Notice that the newlines are 
enclosed in double quotes, which means they are parts of strings. (Here's one difference 
between single and double quotes, as mentioned earlier: "\n" prints a newline; '\n' 
prints \n as written.) 


Notice the comma in the print statement. A comma separates items in a list. The 
print statement prints all the items that are listed. Simple as that. 


Now let's look at the statement that concatenates the two DNA fragments SDNA1 and 
SDNA2 into the variable $DNA3: 





SDNA3 = "SDNAI1SDNA2"; 


The assignment to $DNA3 is just a typical assignment as you saw in Example 4-1, a 
variable name followed by the = sign, followed by a value to be assigned. 


The value to the right of the assignment statement is a string enclosed in double quotes. 
The double quotes allow the variables in the string to be replaced with their values. This 
is called string interpolation .2 So, in effect, the string here is just the DNA of variable 
SDNA1, followed directly by the DNA of variable $DNA2. That concatenation of the two 
DNA fragments is then assigned to variable $DNA3. 


2] There are occasions when you might add curly braces during string interpolation. The extra 
curly braces make sure the variable names aren't confused with anything else in the double- 
quoted string. For example, if you had variable $prefix and tried to interpolate it into the 
string I am $prefixinterested, Perl might not recognize the variable, confusing it with a 
nonexistent variable $prefixinterested. But the string I am ${prefix}interested is 
unambiguous to Perl. 


After assigning the concatenated DNA to variable $DNA3, you print it out, followed by a 
blank line: 
print "SDNA3\n\n"; 


One of the Perl catch phrases is, "There's more than one way to do it." So, the next part of 
the program shows another way to concatenate two strings, using the dot operator. The 
dot operator, when placed between two strings, creates a single string that concatenates 
the two original strings. So the line: 


SDNA3 = SDNA1 . S$DNA2; 


illustrates the use of this operator. 
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An operator in a computer language takes some arguments—in this 
case, the strings SDNA1 and $DNA2—and does something to them, 
returning a value—in this case, the concatenated string placed in the 


variable SDNA3. The most familiar operators from arithmetic—plus, 
minus, multiply, and divide—are all operators that take two numbers 
as arguments and return a number as a value. 





Finally, just to exercise the different parts of the language, let's accomplish the same 
concatenation using only the print statement: 

print. $DNAL, SDNAZ, “"\n"; 

Here the print statement has three parts, separated by commas: the two DNA 
fragments in the two variables and a newline. You can achieve the same result with the 
following print statement: 

print "SDNAI1SDNA2\n"; 


Maybe the Perl slogan should be, "There are more than two ways to do it." 


Before leaving this section, let's look ahead to other uses of Perl variables. You've seen 
the use of variables to hold strings of DNA sequence data. There are other types of data, 
and programming languages need variables for them, too. In Perl, a scalar variable such 
as SDNA can hold a string, an integer, a floating-point number (with a decimal point), a 
boolean (true or false) value, and more. When it's required, Perl figures out what 
kind of data is in the variable. For now, try adding the following lines to Example 4-1 
or Example 4-2, storing a number in a scalar variable and printing it out: 

Snumber = 17; 

print Snumber,"\n"; 


4.4 Transcription: DNA to RNA 


A large part of what you, the Perl bioinformatics programmer, will spend your time doing 
amounts to variations on the same theme as Examples 4-1 and 4-2. You'll get some data, 
be it DNA, proteins, GenBank entries, or what have you; you'll manipulate the data; and 
you'll print out some results. 


Example 4-3 is another program that manipulates DNA; it transcribes DNA to RNA. 
In the cell, this transcription of DNA to RNA is the outcome of the workings of a delicate, 
complex, and error-correcting molecular machinery.“! Here it's a simple substitution. 
When DNA is transcribed to RNA, all the T's are changed to U's, and that's all that our 
program needs to know. 


1 Briefly, the coding DNA strand is the reverse complement of the other strand, which is used 
as a template to synthesize its reverse complement as RNA, with T's replaced as U's. With the 


two reverse complements, this is the same as the coding strand with the >u replacement. 


'4] We're ignoring the mechanism of the splicing out of introns, obviously. The T stands for 
thymine; the u stands for uracil. 
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Example 4-3. Transcribing DNA into RNA 


#!/usr/bin/perl -w 
# Transcribing DNA into RNA 


# The DNA 
SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC!; 


# Print the DNA onto the screen 
print "Here is the starting DNA:\n\n"; 


print "SDNA\n\n"; 
# Transcribe the DNA to RNA by substituting all T's with 


Uses 
SRNA = SDNA; 





SRNA =~ s/T/U/g; 


# Print the RNA onto the screen 
print "Here is the result of transcribing the DNA to 
RNA: \n\n"; 


print "SRNA\n"; 


# Exit the program. 

exit; 

Here's the output of Example 4-3: 
Here is the starting DNA: 


ACGGGAGGACGGGAAAATTACTACGGCATTAGC 
Here is the result of transcribing the DNA to RNA: 
ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC 


This short program introduces an important part of Perl: the ability to easily manipulate 
text data such as a string of DNA. The manipulations can be of many different sorts: 
translation, reversal, substitution, deletions, reordering, and so on. This facility of Perl is 
one of the main reasons for its success in bioinformatics and among programmers in 
general. 


First, the program makes a copy of the DNA, placing it in a variable called SRNA: 

SRNA = SDNA; 

Note that after this statement is executed, there's a variable called SRNA that actually 
contains DNA. Remember this is perfectly legal—you can call variables anything you 
like—but it is potentially confusing to have inaccurate variable names. Now in this case, 
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the copy is preceded with informative comments and followed immediately with a 
statement that indeed causes the variable SRNA to contain RNA, so it's all right. Here's a 
way to prevent SRNA from containing anything except RNA: 


[5] Recall the discussion in Section 4.2.4.3 about the importance of the order of the parts in an 
assignment statement. Here, the value of sDNA, that is, the DNA sequence data that has been 
stored in the $DNA variable, is being assigned to the variable $RNA. If you had written spNA = 
SRNA;, the value of the SRNA variable (which is empty) would have been assigned to the SDNA 
variable, in effect wiping out the DNA sequence data in that variable and leaving two empty 
variables. 


(SRNA = $DNA) =~ s/T/U/g; 
In Example 4-3, the transcription happens in this statement: 
SRNA =~ s/T/U/g; 


There are two new items in this statement: the binding operator (=~) and the substitute 
command s/T/U/g. 


The binding operator =~ is used, obviously enough, on variables containing strings; here 
the variable SRNA contains DNA sequence data. The binding operator means "apply the 
operation on the right to the string in the variable on the left." 


The substitution operator , shown in Figure 4-1, requires a little more explanation. The 
different parts of the command are separated (or delimited) by the forward slash. First, 
the s indicates this is a substitution. After the first / comes a T, which represents the 
element in the string that will be substituted. After the second / comes a U, which 
represents the element that's going to replace the T. Finally, after the third / comes g. 
This g stands for "global" and is one of several possible modifiers that can appear in this 
part of the statement. Global means "make this substitution throughout the entire string," 
that is to say, everywhere possible in the string. 


Figure 4-1. The substitution operator 
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Thus, the meaning of the statement is: "substitute all T's for U's in the string data stored in 
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the variable SRNA." 


The substitution operator is an example of the use of regular expressions. Regular 
expressions are the key to text manipulation, one of the most powerful features of Perl as 
you'll see in later chapters. 


4.5 Using the Perl Documentation 


A Perl programmer's most important resource is the Perl documentation. It should be 
installed on your computer, and it may also be found on the Internet at the Perl site. The 
Perl documentation may come in slightly different forms on your computer system, but 
the web version is the same for everybody. That's the version I refer to in this book. See 
the references in Appendix _A for more discussion about different sources of Perl 
documentation. 


Just to try it out, let's look up the print operator. First, open your web browser, and go to 
http ://www.perl.com. Then click on the Documentation link. Select "Perl's Builtin 
Functions" and then "Alphabetical Listing of Perl's Functions". You'll see a rather lengthy 
alphabetical listing of Perl's functions. Once you've found this page, you may want to 
bookmark it in your browser, as you may find yourself turning to it frequently. Now click 
on Print to view the print operator. 


Check out the examples they give to see how the language feature is actually used. This 
is usually the quickest way to extract what you need to know. 


Once you've got the documentation on your screen, you may find that reading it answers 
some questions but raises others. The documentation tends to give the entire story in a 
concise form, and this can be daunting for beginners. For instance, the documentation for 
the print function starts out: "Prints a string or a comma-separated list of strings. Returns 
TRUE if successful." But then comes a bunch of gibberish (or so it seems at this point in 
your learning curve!) Filehandles? Output streams? List context? 


All this information is necessary in documentation; after all, you need to get the whole 
story somewhere! Usually you can ignore what doesn't make sense. 


The Perl documentation also includes several tutorials that can be a great help in learning 
Perl. They occasionally assume more than a beginner's knowledge about programming 
languages, but you may find them very useful. Exploring the documentation is a great 
way to get up to speed on the Perl language. 


4.6 Calculating the Reverse Complement in Perl 


As you recall from Chapter 1, a DNA polymer is composed of nucleotides. Given the 
close relationship between the two strands of DNA in a double helix, it turns out that it's 
pretty straightforward to write a program that, given one strand, prints out the other. Such 
a calculation is an important part of many bioinformatics applications. For instance, when 
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searching a database with some query DNA, it is common to automatically search for the 
reverse complement of the query as well, since you may have in hand the opposite strand 
of some known gene. 


Without further ado, here's Example 4-4, which uses a few new Perl features. As you'll 
see, it first tries one method, which fails, and then tries another method, which succeeds. 


Example 4-4. Calculating the reverse complement of a strand of DNA 


#!/usr/bin/perl -w 
# Calculating the reverse complement of a strand of DNA 


# The DNA 
SDNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC!; 


# Print the DNA onto the screen 
print "Here is the starting DNA:\n\n"; 


print "SDNA\n\n"; 





Calculate the reverse complement 
Warning: this attempt will fail! 











# 

# 

# 

# First, copy the DNA into new variable Srevcom 
# (short for REVerse COMplement) 

# Notice that variable names can use lowercase letters like 
# 

# 

# 

# 

# 











"revcom" as well as uppercase like "DNA". In fact, 
lowercase iS more common. 





It doesn't matter if we first reverse the string and then 
do the complementation; or if we first do the 

complementation 

# and then reverse the string. Same result each time. 

# So when we make the copy we'll do the reverse in the same 

statement. 








# 

Srevcom = reverse SDNA; 

# 

# Next substitute all bases by their complements, 
# A->T, T->A, G->C, C->G 

# 

Srevcom =~ s/A/T/g; 

Srevcom =~ s/T/A/g; 

Srevcom =~ s/G/C/g; 
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Srevcom =~ s/C/G/g; 


# Print the reverse complement DNA onto the screen 
print "Here is the reverse complement DNA:\n\n"; 





print "Srevcom\n"; 


it 

# Oh-oh, that didn't work right! 

# Our reverse complement should have all the bases in it, 
Since the 

# original DNA had all the bases--but ours only has A and G! 
# 

# Do you see why? 

it 

# The problem is that the first two substitute commands 
above change 

# all the A's to T's (so there are no A's) and then all the 
# T's to A's (so all the original A's and T's are all now 
A's). 

# Same thing happens to the G's and C's all turning into 
G's. 

# 


print "\nThat was a bad algorithm, and the reverse 
complement was wrong!\n"; 
pring "Tey again «<< iain"? 





# Make a new copy of the DNA (see why we saved the 
original?) 
Srevcom = reverse SDNA; 


+} Gee the text for a discussion of tr/// 
Srevcom =~ tr/ACGTacgt/TGCAtgca/; 


# Print the reverse complement DNA onto the screen 
print "Here is the reverse complement DNA:\n\n"; 


print "Srevcom\n"; 

print "\nThis time it worked!\n\n"; 

eulty 

Here's what the output of Example 4-4 should look like on your screen: 


Here is the starting DNA: 


ACGGGAGGACGGGAAAAT TACTACGGCATTAGC 
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Here is the reverse complement DNA: 
GGAAAAGGGGAAGAAAAAAAGGGGAGGAGGGGA 


That was a bad algorithm, and the reverse complement was 
wrong! 
Try again 


Here is the reverse complement DNA: 
GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT 
This time it worked! 


You can check if two strands of DNA are reverse complements of each other by reading 
one left to right, and the other right to left, that is, by starting at different ends. Then 
compare each pair of bases as you read the two strands: they should always be paired C 
to Gand A to T. 


Just by reading in a few characters from the starting DNA and the reverse complement 
DNA from the first attempt, you'll see the that first attempt at calculating the reverse 
complement failed. It was a bad algorithm. 


This is a taste of what you'll sometimes experience as you program. You'll write a 
program to accomplish a job and then find it didn't work as you expected. In this case, we 
used parts of the language we already knew and tried to stretch them to handle a new 
problem. Only they weren't quite up to the job. What went wrong? 


You'll find that this kind of experience becomes familiar: you write some code, and it 
doesn't work! So you either fix the syntax (that's usually the easy part and can be done 
from the clues the error messages provide), or you think about the problem some more, 
find why the program failed, and then try to devise a new and successful way. Often this 
requires browsing the language documentation, looking for the details of how the 
language works and hoping to find a feature that fixes the problem. If it can be solved on 
a computer, you can solve it using Perl. The trick is, how exactly? 


In Example 4-4, the first attempt to calculate the reverse complement failed. Each base 
in the string was translated as a whole, using four substitutions in a global fashion. 
Another way is needed. You could march though the DNA left to right, look at each base 
one at a time, make the change to the complement, and then look at the next base in the 
DNA, marching on to the end of the string. Then just reverse the string, and you're done. 
In fact, this is a perfectly good method, and it's not hard to do in Perl, although it requires 
some parts of the language not found until Chapter 5. 


However, in this case, the tr operator—which stands for transliterate or translation—is 
exactly suited for this task. It looks like the substitute command, with the three forward 
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slashes separating the different parts. 


tr does exactly what's needed; it translates a set of characters into new characters, all at 
once. Figure 4-2 shows how it works: the set of characters to be translated are between 
the first two forward slashes. The set of characters that replaces the originals are between 
the second and third forward slashes. Each character in the first set is translated into the 
character at the same position in the second set. For instance, in Example 4-4, C is the 
second character in the first set, so it's translated into the second character of the second 
set, namely, G. Finally, since DNA sequence data can use upper- or lowercase letters 
(even though in this program the DNA is in uppercase only), both cases are included in 
the tr statement in Example 4-4. 


Figure 4-2. The tr statement 


Srevcom =~ tr /ACGT/TGCA/ ; 


base hase 
A maps fo of 
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G aeisonbdsabinssicvuusecincsteecans +C 
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The reverse function also does exactly what's needed, with a minimum of fuss. It's 
designed to reverse the order of elements, including strings as seen in Example 4-4. 


4.7 Proteins, Files, and Arrays 

So far we've been writing programs with DNA sequence data. Now we'll also include the 
equally important protein sequence data. Here's an overview of what is covered in the 
following sections: 

How to use protein sequence data in a Perl program 

How to read protein sequence data in from a file 


Arrays in the Perl language 


For the rest of the chapter, both protein and DNA sequence data are used. 


4.8 Reading Proteins in Files 


Programs interact with files on a computer disk. These files can be on hard disk, CD, 
floppy disk, Zip drive, magnetic tape—any kind of permanent storage. 
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Let's take a look at how to read protein sequence data from a file. First, create a file on 
your computer (use your text editor) and put some protein sequence data into it. Call the 
file NM_021964fragment. pep (you can download it from this book's web site). You 
will be using the following data (part of the human zinc finger protein NM_021964): 
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGOSTVSGELQD 

SVLODRSMPHOQE I LAADEVLOESEMROQQDMI SHDELMVHEETVKNDEEQMETHERLPQ 
GLOYALNVPISVKQEITETDVSEQLMRDKKOIR 











You can use any name, except one that's already in use in the same folder. 


Just as well-chosen variable names can be critical to understanding a program, well- 
chosen file and folder names can also be critical. If you have a project that generates lots 
of computer files, you need to carefully consider how to name and organize the files and 
folders. This is as true for individual researchers as for large, multi-national teams. It's 
important to put some effort into assigning informative names to files. 


The filename NM_021964fragment.pep is taken from the GenBank ID of the record 
where this protein is found. It also indicates the fragmentary nature of the data and 
contains the filename extension .pep to remind you that the file contains peptide or 
protein sequence data. Of course, some other scheme might work better for you; the point 
is to get some idea of what's in the file without having to look into it. 


Now that you've created or downloaded a file with protein sequence data in it, let's 
develop a program that reads the protein sequence data from the file and stores it into a 
variable. Example 4-5 shows a first attempt, which will be added to as we progress. 


Example 4-5. Reading protein sequence data from a file 


#!/usr/bin/perl -w 
# Reading protein sequence data from a file 








# The filename of the file containing the protein sequence 
data 
Sproteinfilename = 'NM 021964fragment.pep'; 


# First we have to "open" the file, and associate 

# a "filehandle" with it. We choose the filehandle 
# PROTEINFILE for readability. 

open (PROTEINFILE, Sproteinfilename) ; 

















# Now we do the actual reading of the protein sequence data 
from the file, 

# by using the angle brackets < and > to get the input from 
the 

# filehandle. We store the data into our variable Sprotein. 
Sprotein = <PROTEINFILE>; 
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# Now that we've got our data, we can close the file. 
close PROTEINFILE; 


# Print the protein onto the screen 
print "Here is the protein:\n\n"; 











print Sprotein; 

exit7 

Here's the output of Example 4-5: 

Here is the protein: 

MN IDDKLEGLFLKCGGIDEMOSSRTMVVMGGVSGOSTVSGELQD 

Notice that only the first line of the file prints out. I'll show why in a moment. 

Let's look at Example 4-5 in more detail. After putting a filename into the variable 


Sproteinfilename, the file is opened with the following statement: 
open (PROTEINFILE, Sproteinfilename) ; 





After opening the file, you can do various things with it, such as reading, writing, 
searching, going to a specific location in the file, erasing everything in the file, and so on. 
Notice that the program assumes the file named in the variable Soroteinfilename 
exists and can be opened. You'll see in a little bit how to check for that, but here's 
something to try: change the name of the filename in Sproteinfilename so that 
there's no file of that name on your computer, and then run the program. You'll get some 
error messages if the file doesn't exist. 


If you look at the documentation for the open function, you'll see many options. Mostly, 
they enable you to specify exactly what the file will be used for after it's opened. 


Let's examine the term PROTEINFILE, which is called a filehandle. With 
filehandles, it's not important to understand what they really are. They're just things you 
use when you're dealing with files. They don't have to have capital letters, but it's a 
widely followed convention. After the Open statement assigns a filehandle, all the 
interaction with a file is done by naming the filehandle. 


The data is actually read in to the program with the statement: 


Sprotein = <PROTEINFILE>; 

Why is the filehandle PROTEINFILE enclosed within angle brackets? These angle 
brackets are called input operators; a filehandle within angle brackets is how you bring in 
data from some source outside the program. Here, we're reading the file called 
NM_021964fragment.pep whose name is stored in variable 
Sproteinfilename, and which has a filehandle associated with it by the open 
statement. The data is being stored in the variable Sprotein and then printed out. 
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However, as you've already noticed, only the first line of this multiline file is printed out. 
Why? Because there are a few more things to learn about reading in files. 


There are several ways to read in a whole file. Example 4-6 shows one way. 
Example 4-6. Reading protein sequence data from a file, take 2 


#!/usr/bin/perl -w 
# Reading protein sequence data from a file, take 2 











# The filename of the file containing the protein sequence 
data 
Sproteinfilename = 'NM 021964fragment.pep'; 


# First we have to "open" the file, and associate 

# a "filehandle" with it. We choose the filehandle 
# PROTEINFILE for readability. 

open (PROTEINFILE, Sproteinfilename) ; 

















# Now we do the actual reading of the protein sequence data 
from the file, 

# by using the angle brackets < and > to get the input from 
the 

# filehandle. We store the data into our variable Sprotein. 
# 

# Since the file has three lines, and since the read only 
is 

# returning one line, we'll read a line and print it, three 
times. 





# First line 
Sprotein = <PROTEINFILE>; 


# Print the protein onto the screen 
print "\nHere is the first line of the protein file:\n\n"; 


Print sproetein; 


# Second line 
Sprotein = <PROTEINFILE>; 


# Print the protein onto the screen 
print "\nHere is the second line of the protein file:\n\n"; 


print $protein; 


# Third line 
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Sprotein = <PROTEINFILE>; 


# Print the protein onto the screen 
print "\nHere is the third line of the protein file:\n\n"; 





print Sprotein; 


# Now that we've got our data, we can close the file. 
close PROTEINFILE; 


exit; 

Here's the output of Example 4-6: 

Here is the first line of the protein file: 
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGOSTVSGELQD 


Here is the second line of the protein file: 


SVLODRSMPHOE I LAADEVLOESEMROQODMTISHDELMVHEETVKNDEEOMETHERLPO 











Here is the third line of the protein file: 
GLOYALNVPISVKQEITETDVSEQLMRDKKOIR 


The interesting thing about this program is that it shows how reading from a file works. 
Every time you read into a scalar variable such as Sprotein, the next line of the file is 
read. Something is remembering where the previous read was and is picking it up from 
there. 


On the other hand, the drawbacks of this program are obvious. Having to write a few 
lines of code for each line of an input file isn't convenient. However, there are two Perl 
features that can handle this nicely: arrays (in the next section) and loops (in Chapter 
2). 


4.9 Arrays 


In computer languages an array is a variable that stores multiple scalar values. The values 
can be numbers, strings, or, in this case, lines of an input file of protein sequence data. 
Let's examine how they can be used. Example 4-7 shows how to use an array to read 
all the lines of an input file. 


Example 4-7. Reading protein sequence data from a file, take 3 


#!/usr/bin/perl -w 
# Reading protein sequence data from a file, take 3 
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# The filename of the file containing the protein sequence 
data 
Sproteinfilename = 'NM 021964fragment.pep'; 


# First we have to "open" the file 
open (PROTEINFILE, Sproteinfilename) ; 


# Read the protein sequence data from the file, and store 
Ae 

# into the array variable @protein 

@protein = <PROTEINFILE>; 





# Print the protein onto the screen 
print @protein; 


# Close the file. 
close PROTEINFILE; 





CxLu; 

Here's the output of Example 4-7: 
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGOSTVSGELQD 
SVLODRSMPHQE I LAADEVLOESEMROQDMISHDELMVHEETVKNDEEOMETHERLPQ 
GLOYALNVPISVKQEITFETDVSEQLMRDKKOIR 














which, as you can see, is exactly the data that's in the file. Success! 


The convenience of this is clear—just one line to read all the data into the program. 


Notice that the array variable starts with an at sign (@) rather than the dollar sign ($) 
scalar variables begin with. Also notice that the print function can handle arrays as 
well as scalar variables. Arrays are used a lot in Perl, so you will see plenty of array 


examples as the book continues. 


An array is a variable that can hold many scalar values. Each item or element is a scalar 
value that can be referenced by giving its position in the array (its subscript or offset). 
Let's look at some examples of arrays and their most common operations. We'll define an 
array @bases that holds the four bases A, C, G, and T. Then we'll apply some of the 


most common array operators. 


Here's a piece of code that demonstrates how to initialize an array and how to use 


subscripts to access the individual elements of an array: 

# Here's one way to declare an array, initialized with a 
Ist Of tour scalar values, 

@bhases, = (TAT, TC, “Et, MES 


# Now we'll print each element of the array 
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print "Here are the array elements:"; 
print "\nFirst element: "; 

print Sbases[0]; 

print "\nSecond element: "; 

print Sbases[1]; 

print "\nThird element: "; 

print Sbases[2]; 

print "\nFourth element: "; 

print Sbases[3]; 








This code snippet prints out: 


First element: A 
Second element: C 
Third element: G 
Fourth element: T 














You can print the elements one a after another like this: 


(bases =] (A, Tel, %EC, TRI 
print "\n\nHere are the array elements: "; 
print @bases; 





which produces the output: 


Here are the array elements: ACGT 
You can also print the elements separated by spaces (notice the double quotes in the 











print statement): 

tCbases. = (At, Ter, te", VES 

print "\n\nHere are the array elements: "; 
print "@bases"; 


which produces the output: 


Here are the array elements: AC GT 
You can take an element off the end of an array with pop: 








@bases = ('A', 'C', 'G', 'T'); 

Sbasel = pop @bases; 

print "Here's the element removed from the end: "; 
print Sbasel, "\nin"; 

print "Here's the remaining array of bases: "; 
print "@bases"; 





which produces the output: 


Here's the element removed from the end: T 








Here's the remaining array of bases: ACG 
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You can take a base off of the beginning of the array with shift: 

Chases. = (Ar, "Ee", *C™, TEs 

Sbase2 = shift @bases; 

print "Here's an element removed from the beginning: "; 
Beint Sbase2, “win; 

print "Here's the remaining array of bases: "; 

print "@bases"; 





which produces the output: 


Here's an element removed from the beginning: A 





Here's the remaining array of bases: C GT 

You can put an element at the beginning of the array with unshift: 

@bases = ('A', 'C', 'G', 'T'); 

Sbasel = pop @bases; 

unshift (@bases, Sbasel); 

print "Here's the element from the end put on the beginning: 


print "@bases\n\n"; 
which produces the output: 


Here's the element from the end put on the beginning: TAC 
G 

You can put an element on the end of the array with push: 

Cbases = CAT, Set, “Er, ME ys 

Sbase2 = shift @bases; 

push (@bases, Sbase2); 

print "Here's the element from the beginning put on the end: 


We 
, 


print "@bases\n\n"; 
which produces the output: 


Here's the element from the beginning put on the end: C GT 
A 


You can reverse the array: 


@bases = ('A', 'C', 'G', 'T'); 
@reverse = reverse @bases; 

print "Here's the array in reverse: "; 
print "@reverse\n\n"; 





which produces the output: 


Here's the array in reverse: TGCA 
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You can get the length of an array: 


bases = ("A", Sel, TET. TRY 
print "Here's the length of the array: "; 
print scalar @bases, "\n"; 


which produces the output: 


Here's the length of the array: 4 

Here's how to insert an element at an arbitrary place in an array using the Perl splice 
function: 

@bases = ('A', 'C', 'G', 'T'); 

splice ( @bases, 2, 0, 'X'); 

print "Here's the array with an element inserted after the 
2nd element: "; 

print "@bases\n"; 





which produces the output: 


Here's the array with an element inserted after the 2nd 
element: AC XGT 


4.10 Scalar and List Context 


Many Perl operations behave differently depending on the context in which they are used. 
Perl has scalar context and list context; both are listed in Example 4-8. 


Example 4-8. Scalar context and list context 


#!/usr/bin/perl -w 
# Demonstration of "scalar context" and "list context" 


@bases = ('A', 'C', 'G', 'T'); 
print "@bases\n"; 

Sa = @bases; 

print va, “wi; 

(Sa) = @bases; 


Print Sa, “\n"; 


exit; 

Here's the output of Example 4-8: 
ACGT 
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4 

A 

First, Example 4-8 declares an array of the four bases. Then the assignment statement 
tries to assign an array (which is a kind of list) to a scalar variable $a: 

Sa = @bases; 


In this kind of scalar context , an array evaluates to the size of the array, that is, the 
number of elements in the array. The scalar context is supplied by the scalar variable on 
the left side of the statement. 


Next, Example 4-8 tries to assign an array (to repeat, a kind of list) to another list, in 
this case, having just one variable, Sa: 


(Sa) = @bases; 


In this kind of /ist context , an array evaluates to a list of its elements. The list context is 
supplied by the list in parentheses on the left side of the statement. If there aren't enough 
variables on the left side to assign to, only part of the array gets assigned to variables. 
This behavior of Perl pops up in many situations; by design, many features of Perl behave 
differently depending on whether they are in scalar or list context. See Appendix B for 
more about scalar and list content. 


Now you've seen the use of strings and arrays to hold sequence and file data, and learned 
the basic syntax of Perl, including variables, assignment, printing, and reading files. 
You've transcribed DNA to RNA and calculated the reverse complement of a strand of 
DNA. By the end of Chapter 5, you'll have covered the essentials of Perl programming. 


4.11 Exercises 


Exercise 4.1 


Explore the sensitivity of programming languages to errors of syntax. Try 
removing the semicolon from the end of any statement of one of our working 
programs and examining the error messages that result, if any. Try changing other 
syntactical items: add a parenthesis or a curly brace; misspell "print" or some 
other reserved word; just type in, or delete, anything. Programmers get used to 
seeing such errors; even after getting to know the language well, it is still 
common to have some syntax errors as you gradually add code to a program. 
Notice how one error can lead to many lines of error reporting. Is Perl accurately 
reporting the line where the error is? 


Exercise 4.2 
Write a program that stores an integer in a variable and then prints it out. 
Exercise 4.3 


Write a program that prints DNA (which could be in upper- or lowercase 
originally) in lowercase (acgt); write another that prints the DNA in uppercase 
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(ACGT). Use the function tr///. 
Exercise 4.4 


Do the same thing as Exercise 4.3, but use the string directives \U and \L for 
upper- and lowercase. For instance, print "\USDNA" prints the data in SDNA in 
uppercase. 


Exercise 4.5 


Sometimes information flows from RNA to DNA. Write a program to reverse 
transcribe RNA to DNA. 


Exercise 4.6 


Read two files of data, and print the contents of the first followed by the contents 
of the second. 


Exercise 4.7 


This is a more difficult exercise. Write a program to read a file, and then print its 
lines in reverse order, the last line first. Or you may want to look up the functions 
push, pop, shift, and unshift, and choose one or more of them to accomplish 
this exercise. You may want to look ahead to Chapter _5 so you can use a loop 
in this program, but this may not be necessary depending on the approach you 
take. Or, you may want to use reverse on an array of lines. 
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Chapter 5. Motifs and Loops 


This chapter continues demonstrating the basics of the Perl language begun in Chapter 
4. By the end of the chapter, you will know how to: 


Search for motifs in DNA or protein 

Interact with users at the keyboard 

Write data to files 

Use loops 

Use basic regular expressions 

Take different actions depending on the outcome of conditional tests 
Examine sequence data in detail by operating on strings and arrays 


These topics, in addition to what you learned in Chapter 4, will give you the skills 
necessary to begin to write useful bioinformatics programs; in this chapter, you will learn 
to write a program that looks for motifs in sequence data. 


5.1 Flow Control 


Flow control is the order in which the statements of a program are executed. A program 
executes from the first statement at the top of the program to the last statement at the 
bottom, in order, unless told to do otherwise. There are two ways to tell a program to do 
otherwise: conditional statements and loops. A conditional statement executes a group of 
statements only if the conditional test succeeds; otherwise, it just skips the group of 
statements. A loop repeats a group of statements until an associated test fails. 


5.1.1 Conditional Statements 


Let's take another look at the open statement. Recall that if you try to open a nonexistent 
file, you get error messages. You can test for the existence of a file explicitly, before 
trying to open it. In fact, such tests are among the most powerful features of computer 
languages. The if , if-else, and unless conditional statements are three such testing 
mechanisms in Perl. 


The main feature of these kinds of constructs is the testing for a conditional. A 
conditional evaluates to a true or false value. If the conditional is true, the 
statements following are executed; if the conditional is false, they are skipped (or vice 
versa). 


However, "What is truth?" It's a question that programming languages may answer in 
q prog g languag y 
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slightly different ways. 


This section contains a few examples that demonstrate some of Perl's conditionals. The 
true-false condition in each example is equality between two numbers. Notice that 
equality of numbers is represented by two equal signs ==, because the single equal sign = 
is already used for assignment to a variable. 


Confusion between = for assignment and == for numeric equality 


is a frequent programming bug, so watch for it! 





The following examples demonstrate whether the conditional will evaluate to true or 
false. You don't ordinarily have much use for such simple tests. Usually you test the 
values that have been read into variables or the return value of function calls—things you 
don't necessarily know beforehand. 


The if statement with a true conditional: 


print “1 equals 1\n\n"; 
} 


produces the output: 


1 equals 1 

The test is 1 == 1, or, in English, "Does 1 equal 1?" Since it does, the conditional 
evaluates to true, the statement associated with the if statement is executed, and a 
message is printed out. 


You can also just say: 


1f( 2) 4 
print "1 evaluates to true\n\n"; 





} 
which produces the output: 


1 evaluates to true 
The if statement with a false conditional: 
if( 1 == 0) f 
print "1 equals 0\n\n"; 
} 
produces no output! The test is 1 == 0 or, in English, "Does 1 equal 0?" Since it doesn't, 
the conditional evaluates to false, the statements associated with the if statement 
aren't executed, and no message is printed out. 
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You can also just say: 


if( 0) { 
print "0 evaluates to true\n\n"; 


} 


which produces no output, since 0 evaluates to false, so the statements associated with 
the if statement are skipped entirely. 


There's another way to write short if statements that mirrors how the English language 
works. In English, you can say, equivalently, "If you build it, they will come" or "They 
will come if you build it." Not to be outdone, Perl also allows you to put the if after the 
action: 


print “1 equals 1l\n\n" if (1 == 1); 


which does the same thing as the first example in this section and prints out: 





1 equals 
Now, let's look at an if-else statement with a true conditional: 
i { == le 
print "1 equals i\n\n"; 
} else { 
print “1 does not sequal 1\n\n"; 








} 
which produces the output: 


1 equals 1 
The if-else does one thing if the test evaluates to true and another if the test 
evaluates to false. Here is if-else with a false conditional: 








te, 1. == 07 4 
print "1 equals 0\n\n"; 
} else { 
print "1 does not equal 0O\n\n"; 


} 
which produces the output: 


1 does not equal 0 
The final example is unless—the opposite of if. It works like the English word 
"unless": e.g., "Unless you study Russian literature, you are ignorant of Chekov." If the 
conditional evaluates to true, no action is taken; if it evaluates to false, the associated 
statements are executed. If "you study Russian literature" is false, "you are ignorant of 
Chekov." 
unless( 1 == 0) { 

print "1 does not equal 0\n\n"; 
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} 
produces the output: 


1 does not equal 0 
5.1.1.1 Conditional tests and matching braces 


Two more comments are in order about these statements and their conditional tests. 


First, there are several tests that can be used in the conditional part of these statements. In 
addition to numeric equality == as in the previous example, you can also test for 
inequality !=, greater than >, less than <, and more. 


Similarly, you can test for string equality using the eq operator: if two strings are the 
same, it's true. There are also file test operators that allow you to test if a file exists, is 
empty, if permissions are set a certain way, and so on (see Appendix B). One common 
test is just a variable name: if the variable contains zero, it's considered false; any other 
number evaluates to true. If the variable contains a nonempty string, it evaluates to 
true; the empty string, designated by "" or '', is false. 


Second, notice that the statements that follow the conditional are 
enclosed within a matching pair of curly braces. These statements 
within curly braces are called a block and arise frequently in Perl.“ 
Matching pairs of parentheses, brackets, or braces, 1.e., (), [ ], <>, and { }, are common 
programming features. Having the same number of left and right braces in the right 
places is essential for a Perl program to run correctly. 


[1] As something of an oddity, the last statement within a block doesn't need a semicolon after 
it. 


Matching braces are easy to lose track of, so don't be surprised if you miss some and get 
error messages when you try to run the program. This is a common syntax error; you 
have to go back and find the missing brace. As code gets more complex, it can be a 
challenge to figure out where the matching braces are wrong and how to fix them. Even if 
the braces are in the right place, it can be hard to figure out what statements are grouped 
together when you're reading code. You can avoid this problem by writing code that 
doesn't try to do too much on any one line and uses indentation to further highlight the 
blocks of code (see Section 5.2).2 


21 Some text editors help you find a matching brace (for instance, in the vi editor, hitting the 
percent key % over a parenthesis bounces you to the matching parenthesis). 


Back to the conditional statements. The /f-e/se also has an if-elsif-else form, as in 
Example 5-1. The conditionals, first the /f and then the e/s/fs, are evaluated in turn, 
and as soon as one evaluates to true, its block is executed, and the rest of the 
conditionals are ignored. If none of the conditionals evaluates to true, the e/se block is 
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executed if there is one—it's optional. 
Example 5-1. if-elsif-else 


#!/usr/bin/perl -w 
# if-elsif-else 





Sword = 'MNIDDKL'; 


# if-elsif-else conditionals 
if (Sword eq 'QSTVSGE') { 





print. "OSTVSGE\n"; 
} elsif (Sword eq 'MRQQDMISHDEL') { 
print "MRQQDMISHDEL\n"; 
} elsif ( Sword eq 'MNIDDKL' ) { 
print "MNIDDKL--the magic word!\n"; 
} else { 


print “Is \"Sword\" a peptide? This program is not 
sure.\n"; 


} 


exit; 

Notice the \" in the e/se block's print statement; it lets you print a double-quote sign 
(") within a double-quoted string. The backslash character tells Perl to treat the 
following " as the sign itself and not interpret it as the marker for the end of the string. 
Also note the use of eq to check for equality between strings. 


Example 5-1 gives the output: 
MNIDDKL--the magic word! 


5.1.2 Loops 


A loop allows you to repeatedly execute a block of statements enclosed within matching 
curly braces. There are several ways to loop in Perl: while loops, for loops, foreach 
loops, and more. Example 5-2 (from Chapter 4) displays the while loop and how 
it's used while reading protein sequence data in from a file. 


Example 5-2. Reading protein sequence data from a file, take 4 
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#!/usr/bin/perl -w 
# Reading protein sequence data from a file, take 4 








# The filename of the file containing the protein sequence 
data 
Sproteinfilename = 'NM 021964fragment.pep' 


# First we have to “open" the file, and in case the 
# open fails, print an error message and exit the program. 
unless ( open(PROTEINFILE, Sproteinfilename) ) { 


print "Could not open file Sproteinfilename!\n"; 
exit; 


} 


# Read the protein sequence data from the file in a "while" 
loop, 

# printing each line as it is read. 

while( Sprotein = <PROTEINFILE> ) { 





print " ###### Here is the next line of the file:\n"; 


print Sprotein; 


} 


# Close the file. 
close PROTEINFILE; 





exit; 
Here's the output of Example 5-2: 

#ittit Here is the next line of the file: 
MNIDDKLEGLFLKCGGIDEMQSSRTIMVVMGGVSGOSTVSGELQD 

#itti#t Here is the next line of the file: 
SVLODRSMPHQE I LAADEVLOESEMROQDMISHDELMVHEETVKNDEEOMETHERLPQ 

#ittit Here is the next line of the file: 
GLOYALNVPISVKQEITEFTDVSEQLMRDKKOIR 
In the while loop, notice how the variable $protein is assigned each time through the 
loop to the next line of the file. In Perl, an assignment returns the value of the assignment. 
Here, the test is whether the assignment succeeds in reading another line. If there is 
another line to read in, the assignment occurs, the conditional is true, the new line is 
stored in the variable Sprotein, and the block with the two print statements is 
executed. If there are no more lines, the assignment is undefined, the conditional is 
false, and the program skips the block with the two print statements, quits the 
while loop, and continues to the following parts of the program (in this case, the close 
and exit functions). 























5.1.2.1 open and unless 
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The open call is a system call, because to open a file, Perl must ask for the file from the 
operating system. The operating system may be a version of Unix or Linux, a Microsoft 
Windows versions, one of the Apple Macintosh operating systems, and so on. Files are 
managed by the operating system and can be accessed only by it. 


It's a good habit to check for the success or failure of system calls, especially when 
opening files. If a system call fails, and you're not checking for it, your program will 
continue, perhaps attempting to read or write to a file you couldn't open in the first place. 
You should always check for failure and let the user of the program know right away 
when a file can't be opened. Often you may want to exit the program on failure or try to 
open a different file. 


In Example 5-2, the open system call is part of the test of the un/ess conditional. 
unless is the opposite of if. Just as in English you can say "do the statements in the 
block if the condition is true"; you can also say the opposite, "do the statements in the 
block unless the condition is true." The Open system call gives you a true value if it 
successfully opens the file; so here, in the conditional test of the un/ess statement, if the 
open call fails, the statements in the block are performed, the program prints an error 
message, and then exits. 


To sum up, conditionals and loops are simple ideas and not difficult to learn in Perl. They 
are among the most powerful features of programming languages. Conditionals allow you 
to tailor a program to several alternatives, and in that way, make decisions based on the 
type of input it gets. They are responsible for a large part of whatever artificial 
intelligence there is in a computer program. Loops harness the speed of the computer so 
that in a few lines of code, you can handle large amounts of input or continually iterate 
and refine a computation. 


5.2 Code Layout 


Once you start using loops and conditional statements, you need to think seriously about 
formatting. You have many options when formatting Perl code on the page. Compare 
these variant ways of formatting an /f statement inside a while Joop: 


Format A 
while ( Saliave ) 4 
Lt { eneeds Nutrients ) 4 
print "Cell needs nutrients\n"; 
} 
} 
Format B 
while ( Salive ) 
{ 
if ( S$needs nutrients } 
{ 
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print "Cell needs nutrients\n"; 


} 


Format C 
while ( Salive ) 
{ 
Lt { Sneéeds nutrients } 
{ 
print "Cell needs nutrients\n"; 

} 

i 
Format D 





while (Salive) {if (¢needs nutrients) {print “Cell needs 
nutrients\n"; }} 


These code fragments are equivalent as far as the Perl interpreter is concerned. That's 
because Perl doesn't rely on how the statements are laid out on the lines; Perl cares only 
about the correct order of the syntactical elements. Some elements need some whitespace 
(such as spaces, tabs, or newlines) between them to make them distinct, but in general, 
Perl doesn't restrict how you use whitespace to lay out your code. 


Formats A and B are common ways to lay out code. They both make the program 
structure clear to the human reading it. Notice how the statements that have a block 
associated with them—the while and if statements—line up the curly braces and indent 
the statements within the blocks. These layouts make clear the extent of the block 
associated with the statements. (This can be critical for long, complicated blocks.) The 
statements inside the blocks are indented, for which you normally use the Tab key or 
groups of four or eight spaces. (Many text editors allow you to insert spaces when you hit 
the Tab key, or you can instruct them to set the tab stops at four, eight, or whatever 
number of spaces.) The overall structure of the program becomes clearer this way; you 
can easily see which statements are grouped in a block and associated with a given loop 
or conditional. Personally, I prefer the layout in Format A, although I'm also perfectly 
happy with Format B. 


Format C is an example of badly formatted code. The flow control of the code isn't clear; 
for instance, it's hard to see if the print statement is in the block of the while statement. 


Format D demonstrates how hard it is to read code with essentially no formatting, even a 
simple fragment like this. 


The Perl style guide, available from the main Perl manual page or from the command line 
by typing: 


perldoc perlstyle 


has some recommendations and some suggestions for ways to write readable code. 
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However, they are not rules, and you may use your own judgment as to the formatting 
practices that work best for you. 


5.3 Finding Motifs 


One of the most common things we do in bioinformatics is to look for motifs, short 
segments of DNA or protein that are of particular interest. They may be regulatory 
elements of DNA or short stretches of protein that are known to be conserved across 


many species. (The PROSITE web site at http://www.expasy.ch/prosite/ has 


extensive information about protein motifs.) 


The motifs you look for in biological sequences are usually not one specific sequence. 
They may have several variants—for example, positions in which it doesn't matter which 
base or residue is present. They may have variant lengths as well. They can often be 
represented as regular expressions, which you'll see more of in the discussion following 


Example 5-3, in Chapter 9, and elsewhere in the book. 


Perl has a handy set of features for finding things in strings. This, as much as anything, 
has made it a popular language for bioinformatics. Example 5-3 introduces this string- 
searching capability; it does something genuinely useful, and similar programs are used 
all the time in biology research. It does the following: 


Reads in protein sequence data from a file 

Puts all the sequence data into one string for easy searching 
Looks for motifs the user types in at the keyboard 
Example 5-3. Searching for motifs 


#!/usr/bin/perl -w 
# Searching for motifs 


# Ask the user for the filename of the file containing 
# the protein sequence data, and collect it from the 
keyboard 
print "Please type the filename of the protein sequence 
datas. "7 

















Sproteinfilename = <STDIN>; 





# Remove the newline from the protein filename 
chomp $proteinfilename; 








# open the file, or exit 
unless ( open(PROTEINFILE, Sproteinfilename) ) { 
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print "Cannot open file \"Sproteinfilename\"\n\n"; 
exit; 


} 





# Read the protein sequence data from the file, and store 
ae 

# into the array variable @protein 

@protein = <PROTEINFILE>; 


# Close the file - we've read all the data into @protein 
now. 
close PROTEINFILE; 








# Put the protein sequence data into a single string, as 
it's easier 

# to search for a motif in a string than in an array of 
# lines (what if the motif occurs over a line break?) 
Sprotein = join( '', @protein); 





# Remove whitespace 
Sprotein =~ s/\s//g; 


# In a loop, ask the user for a motif, search for the motif, 
# and report if it was found. 
# Exit if no motif is entered. 
do { 
print "Enter a motif to search for: "; 


Smotif = <STDIN>; 


# Remove the newline at the end of Smotif 





chomp Smotif; 

# Look for the motif 

if ( $protein =~ /Smotif/ ) { 
Prine "1 found ter\nin"s 

} else { 


print “I ceuldn\*t find it. \nin": 





} 


# exit on an empty user input 
| until { motif =~ /*\s*s/7 )3 
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# exit the program 

exit; 

Here's some typical output from Example 5-3: 

Please type the filename of the protein sequence data: 
NM _021964fragment.pep 

Enter a motif to search for: SVLO 

i found at 








Enter a motif to search for: jkl 
I couldn't find it. 


Enter a motif to search for: QDSV 
found it! 





Enter a motif to search for: HERLPOGLO 
I found it! 


Enter a motif to search for: 
Tt couldn't find 2t. 


As you see from the output, this program finds motifs that the user types in at the 
keyboard. With such a program, you no longer have to search manually through 
potentially huge amounts of data. The computer does the work and does it much faster 
and more accurately than a human. 


It'd be nice if this program not only reported it found the motif but at what position. 
You'll see how this can be accomplished in Chapter 9. An exercise in that chapter 
challenges you to modify this program so that it reports the positions of the motifs. 


The following sections examine and discuss the parts of Example 5-3 that are new: 
Getting user input from the keyboard 

Joining lines of a file into a single scalar variable 

Regular expressions and character classes 

do-until loops 

Pattern matching 


5.3.1 Getting User Input from the Keyboard 


You first saw filehandles in Example 4-5. In Example 5-3 (as was true in 
Example 4-3), a filehandle and the angle bracket input operator are used to read in 
data from an opened file into an array, like so: 
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@protein = <PROTEINFILE>; 

Perl uses the same syntax to get input that is typed by the user at the keyboard. In 
Example 5-3, a special filehandle called STDIN (short for standard input), is used for 
this purpose, as in this line that collects a filename from the user: 

Sproteinfilename = <STDIN>; 


So, a filehandle can be associated with a file; it can also be associated with the keyboard 
where the user types responses to questions the program asks. 


If the variable you're using to save the input is a scalar variable starts with a dollar sign $), 
as in this fragment, only one line is read, which is almost always what you want in this 
case. 


In Example 5-3, the user is requested to enter the filename of a file containing protein 
sequence data. After getting a filename in this fashion, there's one more step before you 
can open the file. When the user types in a filename and sends a newline by hitting the 
Enter key (also known as the Return key), the filename also gets a newline character at 
the end as it is stored in the variable. This newline is not part of the filename and has to 
be removed before the open system call will work. The Perl function chomp removes 
newlines (or its cousins linefeeds and carriage returns) from the end of a string. (The 
older function chop removes the last character, no matter what it is; this caused trouble, 
so chomp was introduced and is almost always preferred.) 


So this part of Perl requires a little bit extra: removing the newline 
from the input collected from the user at the keyboard. Try 
commenting out the chomp function, and you'll see that the open fails, because no 
filename has a newline at the end. (Operating systems have rules as to which characters 
are allowed in filenames.) 


5.3.2 Turning Arrays into Scalars with join 


It's common to find protein sequence data broken into short segments of 80 or so 
characters each. The reason is simple: when data is printed out on paper or displayed on 
the screen, it needs to be broken up into lines that fit into the space. Having your data 
broken into segments, however, is inconvenient for your Perl program. What if you're 
searching for a motif that's split by a newline character? Your program won't find it. In 
fact, some of the motifs searched for in Example 5-3 are split by line breaks. In Perl 
you deal with this sort of segmented data with the Perl function join. In Example 5-3 
Join collapses an array @protein by combining all the lines of data into a single string 
stored in a new scalar variable Sprotein: 

Sprotein = join( '', @protein); 


You specify a string to be placed between the elements of the array as they're joined. In 
this case, you specify the empty string to be placed between the lines of the input file. 
The empty string is represented with the pair of single quotes '' (double quotes "" also 
serve). 
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Recall that in Example 4-2, I introduced several equivalent ways to concatenate two 
fragments of DNA. The use of the join function is very similar. It takes the scalar values 
that are the elements of the array and concatenates them into a single scalar value. Recall 
the following statement from Example 4-2, which is one of the equivalent ways to 
concatenate two strings: 


SDNA3 = SDNA1 . S$DNA2; 


Another way to accomplish the same concatenation uses the jo/n function: 
SDNA3 = join( "", (SDNA1, $DNA2) ); 





In this version, instead of giving an array name, I specify a list of scalar elements: 


(SDNA1, SDNA2) 
5.3.3 do-until Loops 


There's a new kind of loop in Example 5-3, the do-until loop, which first executes 
a block and then does a conditional test. Sometimes this is more convenient than the 
usual order in which you test first, then do the block if the test succeeds. Here, you want 
to prompt the user, get the user's input, search for the motif, and report the results. Before 
doing it again, you check the conditional test to see if the user has input an empty line. 
This means that the user has no more motifs to look for, so you exit the loop. 


5.3.4 Regular Expressions 


Regular expressions let you easily manipulate strings of all sorts, such as DNA and 
protein sequence data. What's great about regular expressions is that if there's something 
you want to do with a string, you usually can do it with Perl regular expressions. 


Some regular expressions are very simple. For instance, you can just use the exact text of 
what you're searching for as a regular expression: if I was looking for the word 
"bioinformatics" in the text of this book, I could use the regular expression: 


/bioinformatics/ 
Some regular expressions can be more complex, however. In this section, I'll explain 


their use in Example 5-3. 


5.3.4.1 Regular expressions and character classes 


Regular expressions are ways of matching one or more strings using special wildcard-like 
operators. Regular expressions can be as simple as a word, which matches the word itself, 
or they can be complex and made to match a large set of different words (or even every 
word!). 


After you join the protein sequence data into the scalar variable Sprotein in Example 
5-3, you also need to remove newlines and anything else that's not sequence data. This 
can include numbers on the lines, comments, informational or "header" lines, and so on. 
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In this case, you want to remove newlines and any spaces or tabs that might be invisibly 
present. The following line of code in Example 5-3 removes this whitespace: 
Sprotein =~ s/\s//q: 


The sequence data in the scalar variable $protein is altered by this statement. You first 
saw the binding operator =~ and the substitute function s/// back in Example 4-3, 
where they were used to change one character into another. Here, they're used a little 
differently. You substitute any one of a set of whitespace characters, represented by \s 
with nothing and by the lack of anything between the second and third forward slashes. In 
other words, you delete any of a set of whitespace characters, which is done globally 
throughout the string by virtue of the g at the end of the statement. 


The \s is one of several metasymbols. You've already seen the metasymbol \n. The \s 
metasymbol matches any space, tab, newline, carriage return, or formfeed. \s can also be 
written as: 


LE Nex Ve] 


This expression is an example of a character class and is enclosed in square brackets. A 
character class matches one character, any one of the characters named within the square 
brackets. A space is just typed as a space; other whitespace characters have their own 
metasymbols: \t for tab, \n for newline, \f for formfeed, and \r for carriage return. A 
carriage return causes the next character to be written at the beginning of the line, and a 
formfeed advances to the next line. The two of them together amount to the same thing as 
a newline character. 


Each s/// command I've detailed has some kind of regular expression between the first 
two forward slashes /. You've seen single letters as the C in s/C/G/g in that position. The 
C is an example of a valid regular expression. 


There's another use of regular expressions in Example 5-3. The line of 
code: 


if ( Smotif =~ /*\s*S/ ) { 

is, in English, testing for a blank line in the variable $motif. If the user input is nothing 
except for perhaps some whitespace, represented as \s*, the match succeeds, and the 
program exits. The whole regular expression is: 


fe \eees 


which translates as: match a string that, from the beginning (indicated by the *%), is zero 
or more (indicated by the *) whitespace characters (indicated by the \s) until the end of 
the string (indicated by the $). 


If this seems somewhat cryptic, just hang in there and you'll soon get familiar with the 
terminology. Regular expressions are a great way to manipulate sequence and other text- 
based data, and Perl is particularly good at making regular expressions relatively easy to 
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use, powerful, and flexible. Many of the references in Appendix A contain material on 
regular expressions, and there's a concise summary in Appendix B. 


5.3.4.2 Pattern matching with =~ and regular expressions 


The actual search for the motif happens in this line from Example 5-3: 
if ( Sprotein =~ /Smotif/ ) { 


Here, the binding operator =~ searches for the regular expression stored as the value of 
the variable Smotif in the protein Sprotein. Using this feature, you can interpolate 
the value of a variable into a string match. (Interpolation in Perl strings means inserting 
the value of a variable into a string, as you first saw in Example 4-2 when you were 
concatenating strings). The actual motif, that is, the value of the string variable Smotif, 
is your regular expression. The simplest regular expressions are just strings of characters, 
such as the motif AOOK, for example. 


You can use Example 5-3 to play with some more features of regular expressions. You 
can type in any regular expression to search for in the protein. Try starting up the 
program, referring to the documentation on regular expressions, and play! Here are some 
examples of typing in regular expressions: 


Search for an A followed by a D or S, followed by a V: 


e Enter a motif to search for: A[DS]V 
I couldn't find, 1t. 


e Search for K, N, zero or more D's, and two or more E's (note that {2, } means 
"two or more"): 


e Enter a motif to search for: KND*E{2, } 
i Bound: a! 


Search for two E's, followed by anything, followed by another two E's: 


e Enter a motif to search for: EE.*EE 
i. found: ie) 


In that last search, notice that a period stands for any character except a newline, and ".*" 
stands for zero or more such characters. (If you want to actually match a period, you have 
to escape it with a backslash.) 


5.4 Counting Nucleotides 


There are many things you might want to know about a piece of DNA. Is it coding or 
noncoding? Does it contain a regulatory element? Is it related to some other known 
DNA, and if so, how? How many of each of the four nucleotides does the DNA contain? 
In fact, in some species the coding regions have a specific nucleotide bias, so this last 
question can be important in finding the genes. Also, different species have different 
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patterns of nucleotide usage. So counting nucleotides can be interesting and useful. 


51 Coding DNA is DNA that codes for a protein, that is, it is part of a gene. In many organisms, 
including humans, a large part of the DNA is noncoding—not part of genes and doesn't code 
for proteins. In humans, about 98-99% of DNA is noncoding. 


In the following sections are two programs, Examples 5-4 and 5-6, that make a count of 
each type of nucleotide in some DNA. They introduce a few new parts of Perl: 


"Exploding" a string 

Looking at specific locations in strings 
Iterating over an array 

Iterating over the length of a string 


To get the count of each type of nucleotide in some DNA, you have to look at each base, 
see what it is, and then keep four counts, one for each nucleotide. We'll do this in two 
ways: 


Explode the DNA into an array of single bases, and iterate over the array (that is, deal 
with the elements of the array one by one) 


Use the swbstr Perl function to iterate over the positions in the string of DNA while 
counting 


First, let's start with some pseudocode of the task. Afterwards, we'll make more detailed 
pseudocode, and finally write the Perl program for both approaches. 


The following pseudocode describes generally what is needed: 


for each base in the DNA 
if base is A 
count of A = count oF A + 
if base is C 
count or TU 
if base is G 
count_of G 
if base is T 
count.of T = count of T + 





Coun’ Gf C+ 


count of GC + 























done 





PEInNe count of A, count oF UC, Count of G, count of T 


As you can see, this is a pretty simple idea, mirroring what you'd do by hand if you had to. 
(If you want to count the relative frequencies of the bases in all human genes, you can't 
do it by hand—there are too many of them—and you have to use such a program. Thus 
bioinformatics.) Now let's see how it can be coded in Perl. 
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5.5 Exploding Strings into Arrays 


Let's say you decide to explode the string of DNA into an array. By explode I mean 
separating out each letter in the string—sort of like blowing the string into bits. In other 
words, the letters representing the bases of the DNA in the string are separated, and each 
letter becomes its own scalar value in an array. Then you can look at the array elements 
(each of which is a single character) one by one, making the count as you go along. This 
is the inverse of the join function in Section 5.3.2, which takes an array of strings 
and makes a single scalar value out of them. (After exploding a string into an array, you 
could then join the array back into an identical string using join, if you so desire.) 


I'm also adding to this version of the pseudocode the instructions to get the DNA from a 
file and manipulate that file data until it's a single string of DNA sequence. So first, you 
join the data from the array of lines of the original file data, clean it up by removing 
whitespace until only sequence is left, and then explode it back into an array. But, of 
course, the point is that the last array has exactly what is needed, the data in a convenient 
form to use in the counting loop. Instead of an array of lines, with newlines and possibly 
other unwanted characters, there's an exact array of the individual bases. 


read in the DNA from a file 


join the lines of the file into a single string SDNA 





# make an array out of the bases of SDNA 
@DNA = explode $DNA 


# initialize the counts 
count of A = 0 

conn. oi C= 
count of G 
count _of T 





0 
0 
0 


for each base in @DNA 


if base is A 
Count Of A = Count of A st 
if base Le-C 
count of C = count Of C's 
if base is. G 
count of GC = count of G+ 
if Dase 16 7 
count of T = Count oL Ts 


























done 
print count of A, count of C, count of G, count _of T 


As promised, this version of the pseudocode is a bit more detailed. It suggests a method 
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to look at each of the bases by exploding the string of DNA into an array of single 
characters. It also initializes the counts to zero to ensure they start off right. It's easier to 
see what's happening if you spell out the initialization in the program, and it can prevent 
certain kinds of errors from creeping into your code. (It's not a rule, however; sometimes, 
you may prefer to leave the values of variables undefined until they are used.) Perl 
assumes that an uninitialized variable has the value 0 if you try to use it as a number, for 
instance by adding another number to it. But you'll most likely get a warning if that is the 
case. 


We now have a design for the program, let's turn it into Perl code. Example 5-4 is a 
workable program; you'll see other ways to accomplish the same task more quickly as 
you proceed in this chapter, but speed is not the main concern at this point. 


Example 5-4. Determining frequency of nucleotides 


#!/usr/bin/perl -w 
# Determining frequency of nucleotides 





# Get 
prin 


the name of the file with the DNA sequence data 
"Please type the filename of the DNA sequence data: "; 





cr ict 





S$dna_filename = <STDIN>; 


# Remove the newline from the DNA filename 
chomp $dna_ filename; 











# open the file, or exit 
unless ( open(DNAFILE, $dna_ filename) ) { 


PEINe "Cannot open file \"Sdna filename "\n\n"? 
exit; 


} 


# Read the DNA sequence data from the file, and store it 
# into the array variable @DNA 
@DNA = <DNAFILE>; 





# Close the file 
close DNAFILE; 


# From the lines of the DNA file, 
# put the DNA sequence data into a single string. 
SDNA = join( '', @DNA); 


# Remove whitespace 
SDNA =~ s/\s//g; 
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# Now explode the DNA into an array where each letter of 
the 
# original string is now an element in the array. 

# This will make it easy to look at each position. 

# Notice that we're reusing the variable @DNA for this 
purpose. 
@DNA = split( '', SDNA ); 














# Initialize the counts. 

# Notice that we can use scalar variables to hold numbers. 
scount_of A = 
scount_of C = 
Scount of G 


scount of T 
Serrors = 





II 
S66 oS 





# In a loop, look at each base in turn, determine which of 
the 

# four types of nucleotides it is, and increment the 

# appropriate count. 

foreach Sbase (@DNA) { 





if ( Sbase eq 'A' ) { 
++Scount_of A; 

} elsif ( Sbase eq 'C' ) { 

,Counet.O© Cy 

} elsif ( Sbase eq 'G' ) { 

Scount_of G; 

} elsif ( Sbase eq 'T' ) { 

++5Count. of T; 

} else { 


























base: Sbase\n"; 
++Serrors; 


# print the results 

print "A = $Scount_of A\n"; 
pring "Cc = Seeunt of Cyn"; 
print "G = Scount_of G\n"; 
Print “T = scout of T\n"; 
print "errors = Serrors\n"; 


# exit the program 
Sxl; 


To demonstrate Example 5-4, I have created the following small file of DNA and 
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called it small.dna: 
AAAAARARAARARAAGGCECEGTTITTOCOCCCcC 
CCCCCCTCCTACGTAAACTATGCACTAGCVG 
COCCCCCOCCCCGGCCGGCGAAAAAAAAAAAAAAATTTTTTAT 
AAACG 


The file small.dna can be typed into your computer using your favorite text editor, or you 
can download it from this book's web site. 


Notice that there is a V in the file, an error.“ Here is the output of Example 5-4: 


(41 Files of DNA sequence data sometimes include such characters as N, meaning "some 
undetermined base," or other special characters. You sometimes have to look at the 
documentation for the source, say an ABI sequencer or a GenBank file or whatever, to 
discover which characters are used and what they mean. 


Please type the filename of the DNA sequence data: 
small.dna 





llitit!! Error - I don't recognize this base: V 
A = 40 
C = 27 
G = 24 
T flee 


Now let's look at the new stuff in this program. Opening and reading the sequence data is 
the same as previous programs. The first new thing is at this line: 


@DNA = split( '', SDNA); 

which the comments say will explode the string $DNA into an array of single characters 
@DNA. 

split is the companion to jo/n, and it's a good idea to take a little while to look over the 
documentation for these two commands. Calling sp/it with an empty string as the first 
argument causes the string to explode into individual characters; that's just what we 
want.=! 





[5] As you'll see in the documentation for the split function, the first argument can be any 
regular expression, such as /\s+/ (one or more adjacent whitespace characters.) 


Next, there are five scalar variables initialized to 0, the variables Scount of A and so 
forth. J nitializing means assigning an initial value, in this case, the value 0. 


Example 5-4 illustrates the concepts of type and initialization. The type of a variable 
determines what kind of data it can hold, for instance, strings or numbers. Up to now 
we've been using scalar variables such as SDNA to store strings of letters such as A, C, G, 
and T. Example 5-4 shows that you can also use scalar variables to store numbers. For 
example, the variable Scount of Akeepsa running count of the character A. 
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Scalar variables can store integers (0, 1, -1, 2, -2, ...), decimal or floating-point numbers 
such as 6.544, and numbers in scientific notation such as 6.544E6, which translates as 
6.544 x 106, or 6,544000. (See Appendix B for more details on types of numbers.) 


In Example 5-4, the variables Scount_of A through $count _of_T are initialized 
to 0. Initializing a variable means giving it a value after it's declared. If you don't 
initialize your variables, they assume the value of 'undef'. In Perl, an undefined 
variable is 0 if it is asked for in numerical context; it's an empty string if used in a string 
operation. Although Perl programmers often choose not to initialize variables, it's a 
critical step in many other languages. In C for instance, uninitialized variables have 
unpredictable values. This can wreak havoc with your output. You should get in the habit 
of initializing variables; it makes the program easier to read and maintain, and that's 
important. 


To declare a variable means to specify its name and other attributes such as an initial 
value and a scope (for scoping, see Chapter 6 and the discussion of my variables). 
Many languages require you to declare all variables before using them. For this book, up 
to now, declarations have been an unnecessary complication. The next chapter begins to 
require declarations. In Perl, you may declare a variable's scope (see Chapter 6 and the 
discussion of my variables) in addition to an initial value. Many languages also require 
you to declare the type of a variable, for example "integer," or "string," but Perl does not. 


Perl is written to be smart about what's in a scalar variable. For 
instance, you can assign the number 1234 (without quotes) to a variable, or 
you can assign the string '1234' (with quotes). Perl treats the variable as a string for 
printing, and as a number for using in arithmetic operations, without your having to 
worry about it. Example 5-5 demonstrates this ability. In other words, Perl isn't strict 
about specifying the type of data a variable is used for. 


Example 5-5. Demonstration of Perl's built-in knowledge about numbers and 
strings 


#!/usr/bin/perl -w 
# Demonstration of Perl's built-in knowledge about numbers 


and strings 


Snum = 1234; 


Sstr '1234'; 


# print the variables 
Print shim. ” “y Seer, “A's 


# add the variables as numbers 
Snum_or str = $num + $str; 
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pring enum or etre, “n"; 


# concatenate the variables as strings 
Ssnum_or str = $num . $str; 


Paine enum or str, “na; 


exit; 

Example 5-5 produces the output: 
1234 1234 

2468 

12341234 





Example 5-5 illustrates the smart way Perl determines the datatype of a scalar variable, 
whether it's a string or a number, and whether you're trying to add or subtract it like a 
number or concatenate it like a string. Perl behaves accordingly, which makes your job as 
a programmer a little bit easier; Perl "does the right thing" for you. 


Next is a new kind of loop, the foreach loop. This loop works over the elements of an 
array. The line: 


foreach Sbase (@DNA) { 


loops over the elements of the array @DNA, and each time through the loop, the scalar 
variable Sbase (or whatever name you choose) is set to the next element of the array. 


The body of the loop checks for each base and increments the count for that base if found. 
There are four ways to add 1 to a number in Perl. Here, you put a ++ in front of the 
variable, like this: 


++Sco0unt; 
You can also put the ++ after the variable: 
Scount++; 


You can spell it out like this, a combination of adding and assignment: 
Scount = $count + 1; 

or, as a Shorthand of that, you can say: 

Scount += 1; 


Almost an embarrassment of riches. The plus-plus (++) notation is convenient for 
incrementing counts, as we're doing here. The plus-equals (+=) notation saves some 
typing and is very popular for adding other numbers besides 1. 


The foreach loop in Example 5-5 could have been written like this: 


IT-SC 90 


foreach (@DNA) f 
































if ( /A/ ) { 
Scount_of A; 
> eleis { 7e7 + 4 
PCOuUnt: oF C? 
} eleit ( (G7 ) 4 
Scount_of G; 
} €leie. { #27) 4 
ecount of T; 
} else { 
print "!!!!!!!! Error - I don\'t recognize this 
base: "; 
print; 


pring nas 
++Serrors; 
i 
} 


This version of the foreach loop: 
foreach(@DNA) {. 


doesn't have a scalar value. In a foreach loop, if you don't specify a scalar variable to 
hold the scalars that are being read from the array (Sbase served that function in the 
version of this loop in Example 5-5), Perl uses the special variable $_. 


Furthermore, many Perl built-in functions operate on this special variable if no argument 
is provided to them. Here, the conditional tests are simply patterns; Perl assumes you're 
doing a pattern match on the $__ variable, so it behaves as if you had said $_ =~ /A/, 
for instance. Finally, in the error message, the statement print; prints the value of the 
$_ variable. 


This special variable $_ that doesn't have to be named appears in many Perl 
programs, although I don't use it extensively in this book. 


5.6 Operating on Strings 


It's not necessary to explode a string into an array in order to look at each character. In 
fact, sometimes you'd want to avoid that. A large string takes up a large amount of 
memory in your computer. So does a large array. When you explode a string into an array, 
the original string is still there, and you also have to make a copy of each character for 
the elements of the new array you're creating. If you have a large string, that already uses 
a good portion of available memory, creating an additional array can cause you to run out 
of memory. When you run out of memory, your computer performs poorly; it can slow to 
a crawl, crash, or freeze ("hang"). These haven't been worrisome considerations up to 
now, but if you use large data sets (such as the human genome), you have to take these 
things into account. 
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So let's say you'd like to avoid making a copy of the DNA sequence data into another 
variable. Is there a way to just look at the string SDNA and count the bases from it 


directly? Yes. Here's some pseudocode, followed by a Perl program: 
read in the DNA from a file 











join the lines of the file into a single string of SDNA 
# initialize the counts 

count of A= 9 

count of C = 0 

count of G= 0 

count of T = 0 


for each base at each position in SDNA 


if base is A 
count. oF A= count of A + 
if base 16°C 
Counter CU = Ceune cr f + 
if base is G 
count of GC 
ix Dese is. 7 
count of PF = coun of TF + 


| 


count of G + 




















done 


print count of A, Count OF C, Count of G, count of 7 
Example 5-6 shows a program that examines each base in a string of DNA. 


Example 5-6. Determining frequency of nucleotides, take 2 


#!/usr/bin/perl -w 
# Determining frequency of nucleotides, take 2 


# Get 
prin 


the DNA sequence data 
"Please type the filename of the DNA sequence data: "; 








Sdna_filename = <STDIN>; 


chomp $dna_ filename; 








# Does the file exist? 
unless {( -é@ Sdna filename) { 





prin’. “File | "edie fileneme” doesn)" seem fo 
exise! !\n": 
exit; 


IT-SC o2 


# Can we open the file? 
unless ( open(DNAFILE, $dna_ filename) ) { 





DEINE “Cannot open Tile \"Sdne filename" \n\n": 
exit; 


} 

@DNA = <DNAFILE>; 

close DNAFILE; 

SDNA = join( '', @DNA); 


# Remove whitespace 
SDNA =~ s/\s//g; 


# Initialize the counts. 

# Notice that we can use scalar variables to hold numbers. 
scount_of A = 
scount_of C = 
scount_of G = 
scount_of T = 
Serrors = 


° 
, 
. 
, 


. 
, 





. 
, 


Oo 2 © @ © 


° 
, 


# In a loop, look at each base in turn, determine which of 
the 

# four types of nucleotides it is, and increment the 

# appropriate count. 

for ( Sposition = 0 ; Sposition < length SDNA ; 
++Sposition ) { 





Sbase = substr(SDNA, Sposition, 1); 


if ( Sbase eq 'A' ) { 
++S$count_of A; 

} elsif ( Sbase eq 'C' ) { 

SP POOUinie- OF CF 

} elsif ( Sbase eq 'G' ) { 

++oCount. Of G; 

} elsif ( Sbase eq 'T' ) { 

+HOCOUNE Of T? 

} else { 














base: Sbase\n"; 
++Serrors; 
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} 


# print the results 
print "A = $count_of A\n"; 
print "C = $count_of C\n"; 








print "“G = Scount of G\n"; 
pein’ “i = ecoune of Ty"; 
print "errors = Serrors\n"; 


# exit the program 

exit; 

Here's the output of Example 5-6: 

Please type the filename of the DNA sequence data: 
small.dna 





llitrit! Error - I don't recognize this vase: V 
A = 40 

C = 27 

G = 24 

li 17 

errors = 1 

In Example 5-6, I added a line of code to see if the file exists: 
unless ( -e $dna_ filename) { 


There are file test operators for several conditions; see Appendix B or Perl 
documentation under -X. Note that files have several attributes, such as size, permission, 
location in the filesystem, and type of file, and that many of these things can be tested for 
easily with the file test operators. 


Notice, also, that I have kept the detailed comments about the regular expression, because 
regular expressions can be hard to read, and a little commenting here helps a reader to 
skim the code. 


Everything else is familiar, until you hit the for loop; it requires a little explanation: 
for ( Sposition = 0 ; Sposition < length SDNA ; 
++Sposition ) { 


# the statements in the block 
} 
This for loop is the equivalent of this while loop: 
Sposition = 0; 


while( Sposition < length SDNA ) { 





# the same statements in the block, plus 


++Sposition; 
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Take a moment and compare these two loops. You'll see the same statements but in 
different locations. 


As you can see, the for loop brings the initialization and increment of a counter 
(Sposition) into the loop statement, whereas in the while loop, they are separate 
statements. In the for loop, both the initialization and the increment statement are placed 
between parentheses, whereas you find only the conditional test in the while loop. In the 
for loop, you can put initializations before the first semicolon and increment statements 
after the second semicolon. The initialization statement is done just once before starting 
the loop, and the increment statement is done at the end of each iteration through the 
block before going back to the conditional test. It's really just a shorthand for the 
equivalent while loop as just shown. 


The conditional test checks to see if the position reached in the string is less than the 
length of the string. It uses the /ength Perl function. Obviously, you don't want to check 
characters beyond the length of the string. But a word is in order here about the 
numbering of positions in strings and arrays. 


By default, Perl assumes that a string begins at position O and its last character is at a 
position that's numbered one less than the length of the string. Why do it this way instead 
of numbering the positions from | up to and including the length of the string? There are 
reasons, but they're somewhat abstruse; see the documentation for enlightenment. If it's 
any comfort, many other programming languages make the same choice. (However, 
many do it the intuitive way, starting at 1. Ah well.) 


This way of numbering is important to biologists because they are used to numbering 
sequences beginning with 1, not with 0 the way Perl does it. You sometimes have to add 
1 to a position before printing out results so they'll make sense to nonprogrammers. It's 
mildly annoying, but you'll get used to it. 


The same holds true for numbering the elements of an array. The first element of an array 
is element 0; the last is element $length-1. 


Anyway, you see that the conditional test evaluates to true while the value of 

Sposition is length-1 or less and fails when Sposition reaches the same value 

as the length of the string. For example, say you have a string that contains the text 
Won 


"seeing." This has a length of six characters. The "s" is at position 0, and the "g" is at 
position 5, which is one less than the string length 6. 


Back in the block, you call the substr function to look into the string: 


Sbase = substr(SDNA, Sposition, 1); 

This is a fairly general-purpose function for working with strings; you can also insert and 
delete things. Here, you look at just one character, so you call substr on the string SDNA, 
ask it to look in position Sposition for one character, and save the result in scalar 
variable Sbase. Then you proceed to accumulate the count as in the preceding version 


of the program, Example 5-4. 
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5.7 Writing to Files 


Example 5-7 shows one more way to count nucleotides in a string of DNA. It uses a 
Perl trick that was designed with exactly this kind of job in mind. It puts a global regular 
expression search in the test for a While loop, and as you'll see, it's a compact way of 
counting characters in a string. 


One of the nice things about Perl is that if you need to do something fairly regularly, the 
language has probably got a relatively succinct way to do it. (The downside of this is that 
Perl has a lot of things about it to learn.) 


The results of Example 5-7, besides being printed to the screen, will also be written to 
a file. The code that accomplishes this writing to a file is as follows: 
# Also write the results to a file called "countbase" 


Soutputfile = "countbase"; 


( 
unless ( open(COUNTBASE, ">Soutputfile") ) { 


print "Cannot open file \"Soutputfile\" to write 
tet nn 3 
exit; 


} 





print COUNTBASE "A=Sa C=Sc G=Sg T=St errors=Se\n"; 





close (COUNTBASE) ; 

As you see, to write to a file, you do an open call, just as when reading from a file, but 
with a difference: you prepend a greater-than sign > to the filename. The filehandle 
becomes a first argument to a print statement (but without a comma following it). This 
makes the print statement direct its output into the file. 


'61 In this case, if the file already exists, it's emptied out first. It's possible to specify several 
other behaviors. As mentioned earlier, the Perl documentation has all the details of the open 
function, which sets the options for reading from, and writing to, files as well as other actions. 


Example 5-7 is the third version of the Perl program that examines each base in a 
string of DNA. 


Example 5-7. Determining frequency of nucleotides, take 3 


#!/usr/bin/perl -w 
# Determining frequency of nucleotides, take 3 


# Get the DNA sequence data 
print "Please type the filename of the DNA sequence data: "; 
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Sdna_ filename = <STDIN>; 


chomp $dna_ filename; 








# Does the file exist? 
unless ( -e $dna_ filename) { 


print "File \"Sdne fileneme\” doesn\"t seem to 
exist! !\n"; 
exit; 


} 


# Can we open the file? 
unless ( open(DNAFILE, $dna_filename) ) { 








print “Cannot opén file \"Sdna_filename\"\n\n"; 
exit; 

} 

@DNA = <DNAFILE>; 

close DNAFILE; 


SDNA = join( '', @DNA); 


# Remove whitespace 
SDNA =~ s/\s//g; 


# Initialize the counts. 
# Notice that we can use scalar variables to hold numbers. 
Sa = 0; Sc = 0; $g = 0; St = 0; Se = 0; 


# Use a regular expression "trick", and five while loops, 
# to find the counts of the four bases plus errors 








while ($DNA =~ /a/ig) {$at+} 
while ($DNA =~ /c/ig) {$ct+} 
while (SDNA =~ /g/ig) {Sg++} 
while(S$DNA =~ /t/ig) {St++} 
while (S$DNA =~ /[*acgt]/ig) {Set++} 








print "A=Sa C=Sc G=Sg T=St errors=Se\n"; 


# Also write the results to a file called "countbase" 
Soutputfile = "countbase"; 


unless ( open(COUNTBASE, ">Soutputfile") ) { 
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print "Cannot open file \"Soutputfile\" to write 
pel Tn ns 
exit; 


} 





print COUNTBASE "A=Sa C=Sc G=Sg T=St errors=Se\n"; 





close (COUNTBASE) ; 


# exit the program 

exit; 

Example 5-7 looks like this when you run it: 

Please type the filename of the DNA sequence data: 
small.dna 

A=40 C=27 G=24 T=17 errors=1 
The output file countbase has the following contents after you run Example 5-7: 
A=40 C=27 G=24 T=17 errors=1 
The while loop: 

while ($dna =~ /a/ig) {Sa++} 





























has as its conditional test, within the parentheses, a string-matching expression: 
Sdna =~ /a/ig 


This expression is looking for the regular expression /a/, that is, the letter a. Since it has 
the i modifier, it's a case-insensitive match, which means it matches a or A. It also has 
the global modifier, which means match all the a's in the string. (Without the global 
modifier, it just keeps returning true every time through the loop, if there is an "a" in 
Scnia.) 


Now, this string-matching expression, in the context of a while loop, causes the while 
loop to execute its block on every match of the regular expression. So, append the one- 
statement block: 


{Sat++} 


to increment the counter at each match of the regular expression; in other words, you're 
counting all the a's. 


One other point should be made about this third version of the program. You'll notice 
some of the statements have been changed and shortened this time around. Some 
variables have shorter names, some statements are lined up on one line, and the print 
statement at the end is more concise. These are just alternative ways of writing. As you 
program, you'll find yourself experimenting with different approaches: try some on for 
size. 


The way to count bases in this third version is flexible; for instance, it 
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allows you to count non-ACGT characters without specifying them 
individually. In later chapters, you'll use those while loops to good effect. 
However, there's an even faster way to count bases. You can use the tr transliteration 
function from Chapter 4; it's faster, which is helpful if you have a lot of DNA to count: 





Sa = (Sdna =~ tr7Aa//)s 
Se = (Sdna = tr/Ce//): 
Sg = (Sdna == tr/Gq//); 
ct = (Sdria =~ te/TE//); 


The tr function returns the count of the specified characters it finds in the string, and if 
the set of replacement characters is empty, it doesn't actually change the string. So it 
makes a good character counter. Notice that with tr, you have to spell out the upper- and 
lowercase letters. Also, because tr doesn't accept character classes, there's no direct way 
to count nonbases. You could, however, say: 

Sbasecount = ($dna = ~ tr/ACGTacgt//); 

S nonbase = (length Sdna) - Sbasecount) 


The program however, runs faster using ¢r than using the while loops of Example 5-7. 


You may find it a bit much to have three (really, four) versions of this base-counting 
program, especially since much of the code in each version is identical. The only part of 
the program that really changed was the part that did the counting of the bases. Wouldn't 
it have been convenient to have a way to just alter the part that counts the bases? In 
Chapter 6, you'll see how subroutines allow you to partition your programs in just such 
a way. 


5.8 Exercises 


Exercise 5.1 


Use a loop to write a nonhalting program. The conditional must always evaluate 
to true, every time through the loop. Note that some systems will catch that 
you're in an infinite loop and will stop the program automatically. You will stop 
your program differently, depending on which operating system you use. Ctrl-C 
works on Unix and Linux, a Windows MS-DOS command window, or a MacOS 
X shell window. 


Exercise 5.2 


Prompt the user to enter two (short) strings of DNA. Concatenate the two strings 
of DNA by appending the second to the first using the .= assignment operator. 
Print the two strings as concatenated, and then print the second string lined up 
over its copy at the end of the concatenated strings. For example, if the input 
strings are AAAA and TTTT, print: 
AAAATTTT 

TITTT 


Exercise 5.3 
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Write a program that prints all the numbers from 1 to 100. Your program should 
have much fewer than 100 lines of code! 


Exercise 5.4 


Write a program to calculate the reverse complement of a strand of DNA. Do not 
use the S/// or the tr functions. Use the substr function, and examine each base 
one at a time in the original while you build up the reverse complement. (Hint: 
you might find it easier to examine the original right to left, rather than left to 
right, although either is possible.) 


Exercise 5.5 


Write a program to report on the percentage of hydrophobic amino acids in a 
protein sequence. (To find which amino acids are hydrophobic, consult any 
introductory text on proteins, molecular biology, or cell biology. You will find 
information sources in Appendix A.) 


Exercise 5.6 


Write a program that checks if two strings given as arguments are reverse 
complements of each other. Use the Perl built-in functions split, pop, shift, and 
eq (eq actually an operator). 


Exercise 5.7 


Write a program to report how GC-rich some sequence is. (In other words, just 
give the percentage of G and C in the DNA.) 


Exercise 5.8 


Modify Example 5-3 to not only find motifs by regular expressions but to print 
out the motif that was found. For example, if you search, using regular 
expressions, for the motif EE.*EE, your program should print EETVKNDEE. 
You can use the special variable $ &. After a successful pattern match, this special 
variable is set to hold the pattern that was matched. 


Exercise 5.9 


Write a program that switches two bases in a DNA string at specified positions. 
(Hint: you can use the Perl functions substr or slice. 


Exercise 5.10 


Write a program that writes a temporary file and then deletes it. The unlink 
function removes a file: just say, for example: 
unlink "tmpfile"; 

but also check to see if unlink is successful. 
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Chapter 6. Subroutines and Bugs 


In this chapter you'll extend your basic knowledge in two directions: 
Subroutines 
Using the Perl debugger 


Subroutines are an important way to structure programs. You'll use them in Chapter 7, 
where you'll learn how to use randomization to simulate the mutation of DNA. The Perl 
debugger examines a program's behavior in "slow motion" and helps you find those 
pesky bugs. 


6.1 Subroutines 


Subroutines are an important way to organize a program and are used in all major 
programming languages. 


A subroutine wraps up a bit of code, gives the code a name, and provides a way to pass in 
some values for its calculations and then report back the results. The rest of the program 
can then use the subroutine's code just by calling its name, giving the needed values to 
pass in to the subroutine code and then collecting the results. This use or "invocation" of 
a subroutine is commonly referred to as Calling the subroutine. You can think of a 
subroutine as a program within a program; just as you run programs to get results, so 
your programs call subroutines to get results. Once you have a subroutine, you can use it 
in a program simply by knowing which values to pass in and what kind of values to 
expect it to pass out. 


6.1.1 Advantages of Subroutines 


Subroutines provide several benefits. They endow programs with abstraction, 
modularization, and the ability to create large programs by organizing the code into 
manageable chunks with defined inputs and outputs. 


Say you need to calculate something, for instance the mean of a distribution at several 
places in a program or in several different programs. By writing this calculation as a 
subroutine, you can write it once, and then call it whenever you need it, thus making your 
program: 


Shorter, since you're reusing the code. 
Easier to test, since you can test the subroutine separately. 
Easier to understand, since it reduces clutter and better organizes programs. 


More reliable, since you have less code when you reuse subroutines, so there are fewer 
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opportunities for something to go wrong. 


Faster to write, since you may, for example, have already written some subroutines that 
handle basic statistics and can just call the one that calculates the mean without having to 
write it again. Or better yet, you found a good statistics library someone else wrote, and 
you never had to write it at all. 


There is another subtle, yet powerful idea at work here. Subroutines can themselves call 
other subroutines, that is, a subroutine can use another subroutine for help in its 
calculations.“ By writing a set of subroutines, each of which does one or a few things 
well, you can combine them in various ways to make new subroutines. You can then 
combine the new subroutines, and so on, and the end result can be large and flexible 
programming systems. Decomposing problems into sets of subroutines that can be 
conveniently combined allows you to create environments that can grow and adapt to 
changing conditions with a minimum of effort. 


1] Subroutines can even call themselves, and this so-called recursion can be an elegant way to 
compute (see Chapter 11). 


The trick of all this is in how you partition the code into subroutines. You want 
subroutines that encapsulate something that will be generally useful, and not just called 
once (although that sometimes can be useful too). There are various rules of thumb: a 
subroutine should do one thing well, and it should be no more than a page or two of code. 
These are not real rules, and exceptions are frequent, but they can help you divide your 
code into manageable chunks, suitable for subroutines. 


6.1.2 Writing Subroutines 


Let's look at how subroutines are used and then at how they're defined. 


To use a subroutine, you pass data into the subroutine as arguments, and then you collect 
the return value(s) of the subroutine. For example, say you want a subroutine that, given 
some DNA, appends "ACGT" to the end of the DNA and returns the new, longer DNA. 
Let's call the subroutine @ddACGT. In Perl, you usually call a subroutine by typing its 
name, followed by a parenthesized list of arguments (if any). For example, here's a call to 
addACGT with the one argument Sdna: 

addACGT (Sdna) ; 

When calling a subroutine, older versions of Perl required starting the name of a 
subroutine with the & (ampersand) character. It's still okay to do so (e.g., : &4ddACGT), 
but these days the ampersand is usually omitted. 


21 There are times, even in the newer versions of Perl, when an ampersand is required; you'll 
see one such case in Chapter 11, in Section 11.2.3, which describes the File: :Find module. 
(See also the defined and undef functions in the documentation or the per/ref manpage). 





Example 6-1 demonstrates a subroutine that shows in detail how this works. 


Example 6-1. A subroutine to append ACGT to DNA 
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#!/usr/bin/perl -w 
# A program with a subroutine to append ACGT to DNA 


# The original DNA 
Sdna = 'CGACGTCTTCTCAGGCGA'; 


# The call to the subroutine "addACGT". 

# The argument being passed in is Sdna; the result is saved 
in Slonger dna 

Slonger_dna = addACGT ($dna); 





print "I added ACGT to S$dna and got $longer_dna\n\n"; 
exit; 


Ht Ht HH EH HE HE EE EE EE EE EEE EE EEE EEE EEE EEE EEE EEE EE EOE EE 
HH Ht HH EH HE HE EH HEH EE HE 

# Subroutines for Example 6-1 

Ht Ht HH EH EH EE EE EE EEE EE EEE EEE EEE EEE EEE EE EEE EE 
tH HEH HEH HE HE EH EE HE EH HE 








# Here is the definition for subroutine "addACGT" 


sub addACGT { 
my(S$dna) = @ ; 


Sdna .= 'ACGT'; 
return $dna; 
} 
Example 6-1 produces the following output: 
I added ACGT to CGACGTCTTCTCAGGCGA and got 
CGACGTCTTCTCAGGCGAACGT 


We'll now look at this code to see how subroutines are defined and used in a Perl 
program. 


The first thing to notice, taking the large view, is that the program now has two sections. 
The first section starts from the beginning of the program and ends with the exit 
command. Following that (and announced by a blizzard of comments for easy reading) is 
a section for subroutine definitions, in this case, only the one definition for subroutine 
addACGT. It is common to place all subroutine definitions together at the end of a 
program, for ease in reading. Usually they're listed alphabetically or in some other 
convenient way. 


Actually, it is legal to put the subroutine definitions almost anywhere in a program. This 


is because Perl first scans through the code and does things like check the syntax and 
learn subroutine definitions, before it starts to run the program. In particular, subroutine 
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definitions can come after the point in the code where you use them (not necessarily 
before, which many people assume is the rule), and they don't have to be grouped 
together but can be scattered throughout the code. But our method of collecting them 
together at the end can make reading a program much easier. The possible exception is 
when a small subroutine is used in one section of code, as sometimes happens with the 
sort function, for instance. In this case having the definition right there can save the 
reader paging back and forth between the subroutine definition and its use. Usually, it's 
more convenient to read the program without the subroutine definitions, to get the overall 
flow of the program first, and then go back and look into the subroutines, if necessary. 


As you see, Example 6-1 is very simple. It first stores some DNA into the variable 
Sdna and then passes that variable as an argument to the subroutine call, which looks 
like this: addACGT(Sdna). The subroutine is called by its name, followed by 
parentheses containing the arguments to the subroutine. There may be no arguments, or if 
more than one, they are separated by commas. The value returned by the subroutine can 
be saved; in this program the value is saved in a variable called $longer_ dna, which is 
then printed, and the program exits. 


The part of the program from the beginning to the exit statement is called variously the 
main program or the main body of the program. By looking over this section of the code, 
you can see what happens from the beginning to the end of the program without looking 
into the details of the subroutines. 


Now that you've looked over the main program of Example 6-1, it's time 
to look at the subroutine definition and how it uses the principal of scoping. 


6.2 Scoping and Subroutines 


A subroutine is defined by the reserved word ™ for subroutine definitions, sub; the 
subroutine's name, in this case, @@dACGT; and a block, enclosed in a pair of matching 
curly braces. This is the same kind of block seen earlier in loops and conditional 
statements that groups statements together. 


[1 A reserved word is a fundamental, defined word in the Perl language, such as if, while, 
foreach, Or sub. 


In Example 6-1, the name of the subroutine is addACGT, and the block is everything 
after the name. Here is the subroutine definition again: 
sub addACGT { 

my($dna) = @ ; 


Sdna «<= "“ACGT'; 
return Sdna; 


} 


Now let's look into the block of the subroutine. 
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A subroutine is like a separate helper program for the main program, and it needs to have 
its own variables. You will use two types of variables in your subroutines in this book:“! 


41 In the subroutines in this book, we won't use global variables, which can be seen by both 
the main program and the subroutines; nor will we use variables declared with local, which 
provides a different kind of scoping restriction than my. 


Arguments passed in to the subroutine 
Other variables declared with my and restricted to the scope of the subroutine 


Arguments are the values given to a subroutine when it is used, or called. The values of 
the arguments are passed into the subroutine by means of the special variable @ , as 
you'll see in the next section. 


Other variables a subroutine might use must be protected from interacting with variables 
in other parts of the program, so they have effect only within the subroutine's own scope. 
This is accomplished by declaring them as my variables, as will be explained shortly. 


Finally, most subroutines return their results via the return function. This 
can return a single scalar as in return $dna; in our subroutine @@dACGT, in a list of 
scalars asin return (Sdnal, Sdna2);,inanarrayasinreturn @lines;, and 
more. 


6.2.1 Arguments 


To call a subroutine means to type its name and give it appropriate arguments and, 
usually, collect its results. Arguments , sometimes called parameters, usually 
contain the data that the subroutine computes on. In Example 6-1, this is the call of the 
subroutine @@dACGT with the argument $dna: 

Slonger dna = addACGT ($dna); 


The essential point is that whenever you, the programmer, want to use a subroutine, you 
can call it with whatever argument(s) it is designed to accept and with which you need to 
compute (in this case, whatever DNA that needs ACGT appended to it) and the value of 
each argument appears in the subroutine in the @ array. 


When you call a subroutine with certain arguments, the names of the arguments you 
provide in the call are not important inside the subroutine. Only the values of those 
arguments that are actually passed inside the subroutine are important. The subroutine 
typically collects the values from the @_ array and assigns them to new variables that 
may or may not have the same names as the variables with which you called the 
subroutine. The only thing preserved is the order of the values, not the names of the 
variables containing the values. 


Here's how it works. The first line in the subroutine's block is: 
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my($dna) = @ ; 


The values of the arguments from the call of the subroutine are passed into the subroutine 
in the special array variable @ . You know it's an array because it starts with the @ 
character. It has the brief name "_", and it's a special array variable that comes predefined 
in Perl programs. (It's not a name you should pick for your own arrays.) The array @_ 
contains all the scalar values passed into the subroutine. These scalar values are the 
values of the arguments to the subroutine. In this case, there is one scalar value: the string 
of DNA that's the value of the variable $dna passed in as an argument. 


If the subroutine has more arguments—for instance one argument for DNA, one for the 
associated protein, and one for the name of the gene—they are all passed in and assigned 
to my variables inside the subroutine: 


my (Sdna, $protein,Sname of gene) = @ ; 
If there are no arguments, just omit that statement in the subroutine. 
After the statement: 


my(S$dna) = @ ; 

executes in the subroutine, the passed-in value is assigned to the subroutine's variable 
Sdna. The next section explains why this is a new variable specific to the subroutine. 
The subroutine's variable can be called anything; it certainly doesn't have to be the same 
name as the argument, as it happens to be in this example. What's cool about scoping is 
that it doesn't matter if it is or not. 


_— Beware the common mistake of forgetting the @ array when 
naming your arguments in a subroutine, that is, using the statement: 


my (Sdna) ; 


instead of: 


my(S$dna) = @ ; 


If you make this mistake, the values of the arguments won't appear 
in your subroutine, even though their names are declared. 





6.2.2 Scoping 


By keeping all variables a subroutine uses active only within the subroutine, you can 
make it safe to call the subroutines from anywhere. You make the variables specific only 
to the subroutine by declaring them as myvariables. my is a keyword defined in Perl that 
limits variables to the block in which they are used (in this case, the block is the 
subroutine). 
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[5] There are different models of scoping; my implements a type called /exical/ scoping, also 
known as static scoping. Another method is available in Perl via the /oca/ construct, but you 
almost always want to use my. 


Hiding variables and making them local to only a restricted part of a program, is called 
scoping. In Perl, using my variables is known as lexical scoping, and it's an essential part 
of modularizing your programs. 


You declare that a variable is a myvariable like this: 

my ($x); 

or: 

my $X ; 

or, combining the declaration with an initialization to a value: 
my (Sx) = "49"; 

or, if you're collecting an argument within a subroutine: 

my ($x) = @ ; 


Once a variable is declared in this fashion, it exists only until the end of the block it was 
declared in. So in a subroutine, if you declare all your variables like this (both the 
arguments and any other variables), they are active only in the subroutine. If any variable 
has the same name as another variable elsewhere in the program, you don't have to worry, 
because the my declaration actually creates a new variable, active only in the enclosing 
block, and any other variable of the same name used elsewhere outside the block is kept 
separate. 


The example that showed collecting an argument in a subroutine uses parentheses around 
the variable. Because @__ is an array, the parentheses around the new variables put them 
in array context and ensure that they are initialized correctly (see Chapter 4). 


Always declare all your variables in your subroutines—even those 


variables that don't come in as arguments—such as the my construct. 





Why use scoping? Example 6-2 shows the trouble that can happen when you don't. 
Recall that one of the advantages of subroutines is writing a useful bit of code once and 
then using it whenever you need it. Example 6-2 is a program that has a variable in the 
main program with the same name as a variable in a subroutine it calls. This can easily 
happen if you write the subroutine at a time other than the main program (say six months 
later) or if you call a subroutine someone else wrote. 


Example 6-2. The pitfalls of not using my variables 
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#!/usr/bin/perl -w 
# Tllustrating the pitfalls of not using my variables 


Sdna = 'AAAAA'; 
Sresult = A_to T($dna); 


print "I changed all the A's in Sdna to T's and got 
Sresult\n\n"; 


exit; 


HH HHH HHH HH EH HH HHH HH EE EE HE HH HH HH HH EE EE HH HH HH HH EEE EH HH HH HF 
HHH HHH HH HH EH HHH HH HH HF 
# Subroutines 
HEHEHE HH HH HEH EH HH HH HH HE EE HH HHH HH EE EE HE HH HH HH HE OE EE EH EH HH HH FF 
HHH HEHEHE HH HH EH HH HH HH HF HF 
sub A to T { 

my(Sinput) = @ ; 


Sdna = Sinput; 
Sdna =~ s/A/T/g; 


return $dna; 
} 
Example 6-2 gives the following output: 
I changed all the A's in TTTTT to T's and got TITTT 


What was expected was this output: 


I changed all the A's in AAAAA to T's and got TTTTT 
You can get by this expected output by changing the definition of subroutine A_to_T to 
the following, in which the variable $dna in the subroutine is declared as a myvariable: 
sub A to .T { 

my(Sinput) = @ ; 


my(S$dna) = Sinput; 
S$dna =~ s/A/T/g; 


return Sdna; 


} 


Where exactly did Example 6-2 go wrong? When the program entered the subroutine, 
and used the variable $dna to calculate the string with A's changed to T's, the Perl 
language saw that there was already a variable Sdna being used in the main part of the 


IT-SC 108 


program and just kept using it. When the program returned from the subroutine and got to 
the print statement, it was still using the same (the one and only) variable Sdna. So, 
when it printed the results, the variable Sdna, instead of having the original DNA in it, 
had the altered DNA that had been computed in the subroutine. 


Now this sort of thing can happen a lot. Programmers tend to use certain names for 
variables a great deal: the usual suspects are names such as Stmp, Stemp, $x, Sa, 
Snumber, Svariable, Svar, Sarray, Sinput, Soutput, Sresult, $data, 
Sfile, $filename, and so on. Bioinformaticians are quite fond of Sdna, Sprotein, 
Smotif, Ssequence, and the like. As you start using libraries of subroutines from 
other people and as your programs get larger, it's much easier—and a whole lot safer—to 
let the Perl language worry about avoiding the problem of name collisions. 


In fact, from now on we're going to stop using undeclared variables. 
From this point forward, all our variables, even those in the main 
program, will be declared with my. You can enforce this discipline by adding the 
following directive to your programs: 


use Strict; 


which has the effect of insisting that your programs have all their variables declared as 
my variables. 


Lest you rail at this seemingly unnecessary complication to your coding, compared to the 
simpler and happier days of Chapter 4 and Chapter _5, you should know that many 
languages require declarations for all their variables. The fact that in Perl you don't have 
to enforce strict scoping is handy when you're writing short programs, for example, or 
when you're trying to teach programming without hitting the students with a thousand 
details at the beginning. 


Another benefit you get from strict scoping happens if you accidently misspell a variable 
name while writing a program. If the variables aren't being declared, Perl creates a new 
variable with the (misspelled) name. The program may not work correctly, and it may be 
hard to find where the problem is. By strictly scoping the program, any misspelled 
variables are also undeclared, and Perl complains about it, saving you hours or days of 
hair-pulling and bad language. 


Finally, let's recap how scoping, arguments, and subroutines work by taking another look 
at Example 6-1. The subroutine is called by writing its name @ddACGT, passing it 
the argument Sdna, and collecting results (if any) by assignment to SLonger dna: 
Slonger_dna = addACGT ($dna) ; - 

The first line in the subroutine gets the value of the argument from the special variable 
@_, and stores it in its own variable called $dna, which can't be seen outside the 
subroutine because it uses my. Even though the original variable outside the subroutine is 
also called S$dna, the variable called Sdna within the subroutine is an entirely new 
variable (with the same name) that belongs only to the subroutine due to the use of my. 
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This new variable is in effect only during the time the program is in the subroutine. 
Notice in the output from the print statement at the end of Example 6-2 that even 
though a variable called S$dna is lengthened inside the subroutine, the original variable, 
Sdna, outside the subroutine isn't changed. 


6.3 Command-Line Arguments and Arrays 


Example 6-3 is another program that uses subroutines. You use the command line to 
give the program information it needs (such as filenames, or strings of DNA) without 
having to interactively answer the program's prompts. This is useful if you're scheduling 
a program to run at a time when you won't be there, for instance. 


Example 6-3 also shows a little more about using arrays. You'll see how to use 
subscripts to access a specific element of an array. 


For command-line programs, you type the name of the program, followed by the 
arguments to the program, if any, and then hit the Enter (or Return) key to start the 
program running. In Example 6-3, when the user types the program name, she follows 
that with the argument, which, in this case, is just the string of DNA in which she'll count 
the G's. So the program is called and returns an answer like so: 


AAGGGGTTTCCC 
The DNA AAGGGGTTTCCC has 4 G's in it! 


Of course, many programs come with a graphical user interface (GUI). This gives the 
program some or all of the computer screen and usually includes such things as menus, 
buttons, and places to type in values to set parameters from the keyboard. 


However, many programs are run from a command line. Even the newer MacOS X, 
which is built on top of Unix, now provides a command line. (Although most Windows 
users don't use the MS-DOS command window much, it's still useful, e.g., for running 
Perl programs.) As already mentioned, running a program noninteractively, passing 
parameters in as command-line arguments, allows you to run the program automatically, 
say in the middle of the night when no one is actually sitting at the computer. 


Example 6-3 counts the number of G's in a string of DNA. 

Example 6-3. Counting the G's in some DNA on the command line 
#!/usr/bin/perl -w 

# Counting the number of G's in some DNA on the command 
line 


use strict; 


# Collect the DNA from the arguments on the command line 
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# when the user calls the program. 
# If no arguments are given, print a USAGE statement and 
exit. 


# SO is a special variable that has the name of the program. 
my (SUSAGE) = "SO DNA\n\n"; 





# @ARGV is an array containing all command-line arguments. 
it 
# If it is empty, the test will fail and the print USAGE 
and exit 
# statements will be called. 
unless (@ARGV) { 

print SUSAGE; 

exit; 





} 


# Read in the DNA from the argument on the command line. 
my($dna) = SARGV[0]; 


# Call the subroutine that does the real work, and collect 
the result. 
my(Snum_of Gs) = countG ( Sdna ); 


# Report the result and exit. 
print “\nihe DNA Sdna has enumof Gs ¢\"s in 2el\n\n"; 


exit; 


HHH HH HE HH HH EH HH HE HE HH EH EE EE HE HHH EH EE EE HE HH HE EE EE HE HH HE HH 
HEE HHH HE HE HE HE HH HE HH HH 

# Subroutines for Example 6-3 

HHH HH HE HH HEH EH HE EE HE HH EH HE EE HE HH EH HE EE HE HE HE EE HE HE HE HE HH 
HEE HEH HEHEHE EE HE HE HH 





sub countG { 
# return a count of the number of G's in the argument 
Sdna 


# initialize arguments and variables 
my(Sdna) = @ ; 


my(Scount) = 0; 
# Use the fourth method of counting nucleotides in DNA, 


as shown in 
# Chapter Four, "Motifs and Loops" 
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Scount = ( $dna =~ tr/Goq//); 


return Scount; 


} 


Now let's look at how this program works, while examining and explaining the new 
features. For starters, notice the new line: 


use strict; 


which I will use from now on to ensure all variables are declared with my, thus enforcing 
lexical scoping. 


Perl has some special variables it sets so you can easily use the arguments from the 
command line. Every Perl program has an array variable @ARGV that contains any 
command-line arguments. Also, there's a special variable called $0 (a zero) that has the 
name of the program as it was called from the command line. 


Notice in Example 6-3 that an informative message is defined in the variable SUSAGE 
and that it begins with the value of the variable $0, followed an indication of the 
arguments the program needs. This is a common practice; if the user doesn't give the 
program what it needs, which is determined by some kind of test, the program prints 
information about how to properly use it and exits. 


In fact, this program does check to see if any arguments were typed on the command line. 
It checks if @ARGV has anything in it, in which case it evaluates to true; or if it is 
completely empty, in which case it evaluates to false. If you want the program to 
require an argument be given, you can use the unless conditional, and if @ARGV is 
empty, to print out the SUSAGE statement and exit the program: 


unless (@ARGV) { 

print SUSAGE; 

exit; 
} 
The next bit of code shows something new about arrays, namely, how to extract one 
element from an array, as referenced by a subscript. In other words, it shows how to get 
at the first, fourth, or whichever element. The code in Example 6-3 shows how to 
extract the first element, which as you've seen, is numbered 0: 
my($dna) = SARGV[0]; 


Now you already know there is a first element, since you've just tested to make sure the 
array isn't empty. You get the first element of array @ARGV by changing the @ to a $ and 
appending square brackets containing the desired subscript; 0 for the first element, | for 
the second element, and so on. This syntax indicates that since you're now looking at just 
one element of the array, and it's a scalar variable, you use the dollar sign, as you would 
any other scalar variables. 
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In Example 6-3, you copy this first (and only) element of the command-line array 
@ARGV into the variable $dna 


Finally comes the call to the subroutine, which contains nothing new 
but fulfills a dream from the final paragraph of Chapter 5: 


my (Snum_of Gs) = countG ( $dna ); 


6.4 Passing Data to Subroutines 


When you start parsing GenBank, PDB, and BLAST files in later chapters, you'll need 
more complicated arguments to your subroutines to hold the several fields of data you'll 
parse out of the records. These next sections explain the way it's done in Perl. You can 
skim this section and return for a closer read when you get to Chapter 10. 


6.4.1 Subroutines: Pass by Value 

So far, all our subroutines have had fairly simple arguments. The values of these 
arguments are copied and passed to the subroutines, and whatever happens to those 
values in the subroutine doesn't affect the values of the arguments in the main program. 
This is called pass by value or call by value. For example: 

#!/usr/bin/perl -w 

# Example of pass-by-value (a.k.a. call-by-value) 

use strict; 


my Si = 2; 


Simple sub( 1); 





print "In main program, after the subroutine call, \S$i 
equals $i\n\n"; 


exit; 


i ee a a ee 
iat a aE aE aE aE ae aE aE HE aH 

# Subroutines 
i ee a ee 
tat at ae at aE ae aE at HE aT HE 

sub simple sub { 


my($i) = @ ; 


Si += 100; 
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print “In esubcoutine simple. cub, \1. equale oiini\n"; 


} 
This gives the following output: 


In subroutine simple sub, $i equals 102 





In main program, after the subroutine call, $i equals 2 
6.4.2 Subroutines: Pass by Reference 


If you have more complicated arguments, say a mixture of scalars, arrays, and hashes, 
Perl often cannot distinguish between them. Perl passes all arguments into the subroutine 
as a single array, the special @ array. If there are arrays or hashes as arguments, their 
elements get "flattened" out into this single @__ array in the subroutine. Here's an example: 
#!/usr/bin/perl -w 

# Example of problem of pass-by-value with two arrays 


use strict; 























my @i = (*", ae Ee hire 

my 7 = (a, “ay “eye 

print "In main program before calling subroutine: i = " 
"@i\n"s 

print "In main program before calling subroutine: j = " 
Wes Naas 


reterence sub(@i,;, @]); 
































print "In main program after calling subroutine: i =" 
"Qi \n's 

print "In main program after calling subroutine: j = " 
"@>) \nl" 5 

exit; 


Hat HH HH Ht HH HH HH aE EH aE EH HE aE EE EE HE EE HE EE HOE aE aE OE EE EO EE HE SE 
Hat HH Hat HH a HE HH EH HE HE 

# Subroutines 

Hit t Hf Ht Ht HH a HH HH aE EH HE aE EH OE aE aE EE EE EO EE EE aE EH HE aE aE OE EE HE EE HO EE 
Hat tH Hat Ht HH HH HH EH HE Ht 


sub xeterence sub 4 


my(@i, @j) =@; 
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print "In subroutine : i=" . "@i\n"; 
print "In subroutine : j=" . "@j\n"; 





push(@i, '4'); 


shift (@j); 
} 


The following output illustrates the problem of this approach: 


n main program before calling subroutine: i 
n main program before calling subroutine: j 
[In subroutine : i1=123abe 
n 
n 











ll 
oO 
ON 
QW 














subroutine : j = 

main program after calling subroutine: i= 1 2 3 

n main program after calling subroutine: j =abec 

As you see, in the subroutine all the elements of @i and @j were grouped into one @_ 
array. All distinction between the two arrays you started with was lost in the subroutine. 
When you try to get the two arrays back in the statement: 

my(@i, @j) = @ ; 























Perl assigns everything to the first array, @i. This behavior makes passing multiple arrays 
into subroutines somewhat dicey. 


Also, as usual, the original arrays in the main program were not affected by the 
subroutine, since you used lexical scoping (my variables). 


To get around this problem, you can pass arguments into subroutines 
in a style called pass by reference or call by reference. Using pass by 
reference, you can pass a subroutine any collection of scalars, arrays, hashes, and more, 
and the subroutine can distinguish between them. There is a price to pay: the resulting 
code looks a little more complex. But the payoff is often well worth it. 


There is one big difference in the behavior of arguments that are passed by reference. 
When argument variables are passed in this fashion, anything you do to the values of the 
argument variables in the subroutine also affects the values of the arguments in the main 
program. 


To call a subroutine that has its arguments passed by reference, you call it the same way 
as before, with one difference: you must preface the argument names with a backslash. In 
the example of pass by reference in this section, the subroutine call is accomplished like 
SO: 


reference sub(\@i, \@j); 


As you see here, the arguments are two arrays, and, to preserve the distinction between 
them as they are passed into the reference_sub subroutine, they are passed by reference 
by prepending their names with a backslash. 
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Within the subroutine, there are a few changes. First, the arguments are collected from 
the @ array, and saved as scalar variables. This is because a reference is a special kind of 
data that is stored in a scalar variable, no matter whether it's a reference to a scalar, an 
array, a hash, or other. The example collects its arguments as follows: 

my(Si, $j) = @; 


reading them from the @ _ array as scalars. 


The subroutine has to do one more thing with these referenced arguments. When it uses 
them, it has to dereference them. To dereference a referenced argument, you have to 
prepend the reference with the symbol that shows what kind of variable it is: a $ for a 
scalar, @ for an array, S for a hash. So these variables have two symbols before their 
name—reading left to right, their usual symbol and then a $ that indicates the variable is 
a reference. The lines: 


push (GS.1, 4"); 

shift (@37])? 

in the following subroutine are the ones that manipulate the arguments. The push adds 
an element '4' to the end of the @i array, and the shift removes the first element from the 
@4 array. Because these arrays have been passed by reference, their names in the 
subroutine are @Si and @$4. (If you want to look at the third element of the @4 array, 
which normally is $3 [2], you'd say $$4 [2].) 


Whatever changes you make to the arguments in the subroutine also take effect in the 
main program. This is because the references are references to the actual arguments; they 
are not copies of their values as in pass by value. So, as you see in the example, after 
calling the subroutine, the arrays in the main program have been altered accordingly: 


#!/usr/bin/perl 
# Example of pass-by-reference (a.k.a. call-by-reference) 


use strict; 
use warnings; 





























my ai = (eg 125 Mt ye 

ny Gy = (at, 7D y TOG 

print "In main program before calling subroutine: i = " 
"Wea Aw": 

print "In main program before calling subroutine: j = " 
wea Wn" sy 


reference sub(\@i, \@j); 





print "In main program after calling subroutine: i=" 
wea Naas 
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print "In main program after calling subroutine: j = 
a 9 gle 





exit; 


Hat Ht Ht a at at HH aE EH HH aE EE HE aE EH EE aE aE EE EE HE EE HE aE EH OE aE aE OE aE EE EE HE EE 
Hat tt Hat HH HH HE EH HE HE 

# Subroutines 

Hat H HH at tH HE HE HH aE EH HE aE aE HE aE aE EE EE HE aE aE EO aE EH HE aE aE EOE EE OE EE HO EE 
Hat tH HH Ht HH HH HH HH HE HE 


Sub rererence sub. 4 
my($i, $j) = @; 


pring “In subroutine «+ 2=-" . "ESi\n"s 
print “In subroutine =: j=" . "@$j\n"; 





push (@si,. 74): 
shift (@$j); 
} 


This gives the following output: 























In main program before calling subroutine: 3 12 3 
In main program before calling subroutine: j =abec 
In subroutine : i= 1 2 3 

In subroutine :j=abe 

In main program after calling subroutine: i 1 2 3. 4 
In main program after calling subroutine: j =bec 

















The subroutine can now distinguish between the two arrays passed on as arguments.The 
changes that were made inside the subroutine to the variables remain in effect after the 
subroutine has ended, and you've returned to the main program. This is the essential 
characteristic of pass by reference. 


6.5 Modules and Libraries of Subroutines 


As you start to build a collection of subroutines, you'll find that you're copying them a lot 
from existing programs and pasting them into new programs. The subroutines then 
appear in multiple program. This makes the listings of your program code a bit verbose 
and repetitive. It also makes modifying a subroutine more complicated because you have 
to modify all the copies. 


In short, subroutines are great, but if you have to keep copying them into each new 
program you write, it gets tiresome. So it's time to start collecting subroutines into the 
handy files called modules or libraries. 
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Here's how it works. You put all your reusable subroutines into a separate file. (Or, as 
you keep writing more and more code, and things get complicated, you may want to 
organize them into several files.) Then you just name the file in your program and presto: 
the subroutine's definitions all get read in, just as if they were in your program. To do this, 
you use the Perl built-in function use, which reads in the subroutine library file. 


Let's call this module BeginPerlBioinfo.pm. You can put all your subroutine definitions 
into it, just as they appear in the program code. Then you can create the module by typing 
in the subroutine definitions as you read the book; or, more easily, it can be downloaded 
from the book's web site. But there is one thing to remember when creating or adding to a 
module: the last line in a module must be 1; or it won't work. This 1; should be the last 
line of the .pm file, not part of the last subroutine. If you forget this, you'll get an error 
message something like: 


BeginPerlBioinfo.pm did not return a true value at jkl line 
14. 

BEGIN failed--compilation aborted at jkl line 14. 

Now, to use any of the subroutines in BeginPer/Bioinfo.pm, you just have to put the 
following statement in your code, near the top (near the use strict statement): 

use BeginPerlBioinfo; 








Note that .pm is left off the name on purpose: that's how Perl handles the names of 
modules. 


There's one last thing to know about using modules to load in subroutines: the Perl 
program needs to know where to find the module. If you're doing all your work in one 
folder, everything should work okay. If Perl complains about not being able to find 
BeginPer|Bioinfo.pm, give full pathname information to the module. If the full pathname 
is /home/tisdall/book/BeginPer|Bioinfo.pm, then use this in your program: 


use lib '/home/tisdall/book'; 
use BeginPerlBioinfo; 





There are other ways to tell Perl where to look for modules; consult the Perl 
documentation for use. 


Beginning in Chapter 8, I'll define subroutines and show the code, but you'll be putting 
them into your module and typing: 


use BeginPerlBioinfo; 





This module is also available for download at this book's web site. 


6.6 Fixing Bugs in Your Code 


Now let's talk about what to do when your program is having trouble. 
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A program can go wrong in any number of ways. Maybe it won't run at all. A look at the 
error messages, especially the first line or two of the error messages, usually leads you to 
the problem, which will be somewhere in the syntax, and its solution, which will be to 
use the correct syntax (e.g., matching braces or ending each statement with a semicolon). 


Your program may run but not behave as you planned. Then you have some problem 
with the logic of the program. Perhaps at some point, you've zigged when you should 
have zagged, like adding instead of subtracting or using the assignment operator = when 
you meant to test for equality between two numbers with ==. Or, the problem could be 
that you just have a poor design to accomplish your task, and it's only when you actually 
try it out that the flaw becomes evident. 


However, sometimes the problem is not obvious, and you have to resort to the heavy 
artillery. 


Fortunately, Perl has several ways to help you find and fix bugs in your programs. The 
use of the statements use strict; and use warnings; should become a habit, as 
you can catch many errors with them. The Perl debugger gives you complete freedom to 
examine a program in detail as it runs. 


6.6.1 use warnings; and use strict; 


In general, it's not too hard to tell when the syntax of a program is wrong because the Perl 
interpreter will produce error messages that usually lead you right to the problem. It's 
much harder to tell when the program is doing something you didn't really want. Many 
such problems can be caught if you turn on the warnings and enforce the strict use of 
declarations. 


You have probably noticed that all the programs in this book up until now start with the 
command interpreter line: 


#!/usr/bin/perl -w 

That —-w turns on Perl's warnings and attempts to find potential problems in your code 
and then to warn you about them. It finds common problems such as variables that are 
declared more than once, and so on, things that are not syntax errors but that can lead to 
bugs. 


Another way to turn on warnings is to add the following statement near the top of the 
program: 


use warnings; 

The statement use warnings; may not be available on your version of Perl, if it's an 
old one. So if your Perl complains about it, take it out and use the -w command instead, 
either on the command interpreter line, or from the command line: 

$ perl -w my program 

However, use warnings; is a bit more portable between different operating systems. So, 
from now on, that's the way I'll turn on warnings in my code. Another important helper 
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you should use is the following statement placed near the top of your program (next to 
use warnings;): 
vse SCeLeT? 


As mentioned previously, this forces you to declare your variables. (It has some options, 
that are beyond the scope of this book.) It finds misspelled variables, undeclared 
variables that may be interfering with other parts of the program, and so on. 


a 


It's best to always use both use strict; and use warnings; 
as 
uw 


_ when writing your Perl code. 





6.6.2 Fixing Bugs with Comments and Print Statements 


Sometimes you can identify misbehaving code by selectively commenting out sections of 
the program until you find the part that seems to cause the problem. You can also add 
print statements at suspicious parts of a misbehaving program to check what certain 
variables are doing. Both of these are time-honored programming techniques, and they 
work well in almost any programming language. 


Commenting out sections of code can be particularly helpful when the error messages 
that you get from Perl don't point you directly at the offending line. This happens 
occasionally. When it does happen you may, by trial and error, discover that commenting 
out a small section of code causes the error messages to go away; then you know where 
the error is occurring. 


Adding print statements can also be a quick way to pinpoint a problem, especially if 
you already have some idea of where the problem is. As a novice programmer, however, 
you may find that using the Perl debugger is easier than adding print statements. In the 
debugger, you can easily set print statements at any line. For instance, the following 
debugger command says to print the values of $i and $k before line 48: 
a 48 pein’ "Si Skin” 

Once you learn how to do it, this method is generally faster and easier than editing the 
Perl program and adding print statements by hand. Using this method is partly a matter 
of taste, since some extremely good Perl programmers prefer to do it the old-fashioned 
way, by adding print statements. 


6.6.3 The Perl Debugger 


My favorite way to deal with nonobvious bugs in my programs is to use the Perl 
debugger. The problem with bugs in code is that once a program starts running, all you 
can see is the output; you can't see the steps a program is taking. The Perl debugger lets 
you examine your program in detail, step by step, and almost always can lead you 
quickly to the problem. You'll also find that it's easy to use with a little practice. 
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There are situations the Perl debugger can't handle well: interacting processes that depend 
on timing considerations, for instance. The debugger can examine only one program at a 
time, and while examining, it stops the program, so timing considerations with other 
processes go right out the window. 


For most purposes, the Perl debugger is a great, essential, programming tool. This section 
introduces its most important features. 


6.6.3.1 A program with bugs 


Example 6-4 has some bugs we can examine. It's supposed to take a sequence and two 
bases, and output everything from those two bases to the end of the sequence (if it can 
find them in the sequence). The two bases can be given as an argument, or if no argument 
is given, the program uses the bases TA by default. 


There is one new thing in Example 6-4. The next statement affects the control flow in 
a loop. It immediately returns the control flow to the next iteration of the loop, skipping 
whatever else would have followed. Also, you may want to recall $_ , which we 
discussed back in Example 5-5 in the context of a foreach loop. 


Example 6-4. A program with a bug or two 


#!/usr/bin/perl 

# A program with a bug or two 

# 

# An optional argument, for where to start printing the 
sequence, 

# is a two-base subsequence. 

# 

# Print everything from the subsequence ( or TA if no 
subsequence 

# is given as an argument) to the end of the DNA. 


# declare and initialize variables 
my Sdna = 'CGACGTCTTCTAAGGCGA'; 

my @dna; 

my S$receivingcommittment; 

my Spreviousbase = ''; 


mySsubsequence = ''; 


if (@ARGV) { 

mySsubsequence = SARGV[0]; 
jelse{ 

Ssubsequence = 'TA'; 


} 
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my Sbasel substr(Ssubsequence, 0, 1); 
my Sbase2 = substr(Ssubsequence, 1, 1); 





# explode DNA 
@dna = split ( '', S$dna ); 


HHtttiti#t Pseudocode of the following loop: 

# 

# If you've received a committment, print the base and 
continue. Otherwise: 

# 

# If the previous base was Sbasel, and this base is Sbase2, 
print them. 

# You have now received a committment to print the rest 
of the string. 

# 


# At each loop, save the previous base. 


foreach (@dna) { 
if (Sreceivingcommittment) { 
print; 
next; 
} elsif (Spreviousbase eq Sbasel) { 
if ( /Sbase2/ ) { 
print Sbasel, Sbase2; 
Srecievingcommitment = 1; 
i 
} 
Spreviousbase = $ ; 


i 
peeing "ni"; 
exit; 


Here's the output of two runs of Example 6-1: 
S perl example 6-4 AA 








S$ perl example 6-4 
TA 

Huh? It should have printed out AAGGCGA when called with the argument AA, and 
TAAGGCGA when called with no arguments. There must be a bug in this program. But, if 
you look it over, there isn't anything obviously wrong. It's time to fire up the debugger. 
What follows is an actual debugging session on Example 6-4, interspersed with 
comments to explain what's happening and why. 


6.6.3.2 How to start and stop the debugger 


IT-SC 122 


The debugger runs interactively, and you control it from the keyboard. The most 
common way to start it is by giving the -d switch to Perl at the command line. Since 
you're using buggy Example 6-4 to demonstrate the debugger, here's how to start that 
program: 


(61 You also can run it automatically to produce a trace of the program in a file. 


perl -d exampleo-4 


Alternatively, you could have added a -d flag to the command interpreter: 
#!/usr/bin/perl -d 


On systems such as Unix and Linux where command interpretation works, this starts the 
debugger automatically. 


To stop the debugger, simply type q. 
6.6.3.3 Debugger command summary 
First, let's try to find the bug in Example 6-4 when it's called with no arguments: 


S$ perl -d example6-4 
Default die handler restored. 





Loading DB routines from perl5db.pl version 1.07 
Editor support available. 


Enter h or ‘'h h' for help, or 'man perldebug' for more help. 


main:: (example6-4:11): my Sdna = 'CGACGTCTTCTAAGGCGA'; 
DB<1> 

Let's stop right here at the beginning and look at a few things. After some messages, 

which may not mean a whole lot right now, you get the excellent information that the 

commands h and h h give more help. Let's try h h: 





DB<1> hh 
List/search source lines: Control script 
execution: 

1 [iln|sub] List source code T Stack 
trace 

- or. List previous/current line s [expr] 
Single step [in expr] 

w [line] List around line n [expr] Next, 
steps over subs 

f filename View source in file <CR/Enter> 
Repeat last n or s 

/pattern/ ?patt? Search forw/backw c 
Return from subroutine 

Vv Show versions of modules c [ln|sub] 


Continue until position 
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Debugger controls: L List 
break/watch/actions 























O | a ose. Set debugger options t [expr] 
Toggle trace [trace expr] 

<[<] | {f{]1>[>] [emd] Do pre/post-prompt b [1ln|event]| sub] 
[cnd] Set breakpoint 

! [N|pat] Redo a previous command i... 4on]) ex D 
Delete a/all breakpoints 

H [-num] Display last num commands a [ln] cmd Do 
cmd before line 

= [a val] Define/list an alias W expr Add a 
watch expression 

h [db cmd] Get help on command A or W 
Delete all actions/watch 

\[ | Jeb cmd Send output to pager 'C!] syscmd Run 
cmd in a subprocess 

q or “D Quit R 
Attempt a restart 
Data Examination: expr Execute perl code, also 
see: 6,Nn,;t expr 

x|m expr Evals expr in list context, dumps the 
result or lists methods. 

p expr Print expression (uses script's current 
package). 

S [[!]pat] List subroutine names [not] matching 
pattern 

V [Pk [Vars]] List Variables in Package. Vars can be 
~pattern or !pattern. 

x [Vare] same as "V current package [Varse]”. 


For more help, type bh cmd lercer, Gr £un man perldebug far 
all docs. 

DBA2> 
It's a bit hard to read, but you have a concise summary of the debugger commands. You 
can also use the h command, which gives several screens worth of information. The | h 
command displays those several pages one at a time; the pipe at the beginning of a 
debugger command pipes the output through a pager, which typically advances a page 
when you hit the spacebar on your keyboard. You should try those out. Right now, 
however, let's focus on a few of the most useful commands. But remember that typing h 
command can give you help about the command. 


6.6.3.4 Stepping through statements with the debugger 


Back to the immediate problem. When you started up the debugger, you saw that it 
stopped on the first line of real Perl code: 


main:: (exampleo-4:11): my Sdna = 'CGACGTCTTCTAAGGCGA'; 
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There's an important point about the debugger you should understand right away. It 
shows the line it's about to execute, not the line it just executed. 


So really, Example 6-4 hasn't done anything yet. You can see from the command 
summary that p tells the debugger to print out values. If you ask it to print the value of 
Sdna, you'll find: 

DB<2> p Sdna 





DB<3> 


It didn't show anything because there's nothing to show; it hasn't even seen the variable 
Sdna yet. So you should execute the statement. There are two commands to use: n or s 
both execute the statement being displayed. (The difference is that n or "next" skips the 
plunge into a subroutine call, treating it like a single statement; s or "single step" enters a 
subroutine and single step you through that code as well.) Once you've given one of these 
commands, you can just hit Enter to repeat the same command. 


Since there aren't any subroutines, you needn't worry about choosing between n and s, so 
let's use n: 


DB<3> n 
main:: (example6-4:12): my @dna; 
De<3> 
This shows the next line (you can see the line numbers of the Perl program at the end of 
the prompt). If you wish to see more lines, the w or "window" command will serve: 
DB<3> w 








9 
10 # declare and initialize variables 
Ae ie my Sdna = 'CGACGTCTTCTAAGGCGA'; 
12==> my @dna; 
2 my Sreceivingcommittment; 
14: my Spreviousbase = ''; 
ie 
16: my Ssubsequence = ''; 
17 
Ve: if (@ARGV) { 
DB<3> 


The current line—the line that will be executed next—is highlighted with an arrow (==>). 


The w seems like a useful thing. Let's get more information about it with the help 
command h w: 


DB<3> h w 
w [line] List window around line. 
DB<4> 








IT-SC 125 


Actually, there's more—hitting w repeatedly keeps showing more of the program; a 
minus sign backs up a screen. But enough of that. 


Now that $dna has been declared and initialized, the program seems wrong on the first 


statement: 


DB<4> p Sdna 
CGACGTCTTCTAAGGCGA 
DB<5> 





That's exactly what was expected. There's no bug, so let's continue examining the lines, 


printing out values here and there: 






































DB<5> n 

main:: (example6-4:13) my Sreceivingcommittment; 
DB<5> n 

main:: (example6-4:14) my Spreviousbase = ''; 
DB<5> n 

main:: (example6-4:16) my Ssubsequence = ''; 
DB<5> n 

main:: (example6-4:18) if (@ARGV) { 
DB<5> p @ARGV 
DB<6> w 

es 

Gs my Ssubsequence = ''; 

17 

18==> if (@ARGV) { 

19: my Ssubsequence = SARGV[0]; 

20 }else{ 

eal Ssubsequence = 'TA'; 

ZA } 

a3 

PAS my Sbasel = substr(Ssubsequence, 0, 1); 
DB<6> n 

main:: (example6-4:21): Ssubsequence = 'TA'; 
DB<6> n 

main:: (example6-4:24): my Sbasel = substr(Ssubsequence, 

Ly 
DB<6> p Ssubsequence 

TA 
DB<7> A 

main:: (example6-4:25): my Sbase2 = substr(Ssubsequence, 

Ly 
DB<7> n 

main:: (example6-4:28): @dna = split ( '', Sdna ); 








DB<7> p Sbasel 
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Hl 
DB<8> p Sbase2 
A 





DB<9> 

So far, everything is as expected; the default subsequence TA is being used, and the 
Sbasel and Sbase2 variables are set to T and A, the first and second bases of the 
subsequence. Let's continue: 

DB<9> n 

main:: (example6-4:39): foreach (@dna) { 

DB<9> p @dna 

CGACGTCTTCTAAGGCGA 

DB<10> p "@dna" 

CGACGTCTTICTAAGGCOGA 

DB<11> 

This shows a trick with Perl and printing arrays: normally they are printed without any 
spacing between the elements, but enclosing an array in double quotes in a print 
statement causes it to be displayed with spaces between the elements. 








Again, everything seems okay, and we're about to enter a loop. Let's look at the whole 
loop first: 








DB<11> w 
36 # 
3] # At each loop, save the previous base. 
38 
39==> foreach (@dna) { 
40: if (Sreceivingcommittment) { 
41: print; 
42: next; 
43 } elsif (Spreviousbase eq Sbasel) { 
44: if ( /Sbase2/ ) { 
45: print Sbasel, Sbase2; 
DB<11> w 
43 } elsif (Spreviousbase eq Sbasel) { 
44: if ( /Sbase2/ ) { 
45: print Sbasel, Sbase2; 
46: Srecievingcommitment = 1; 
47 } 
48 } 
49: Spreviousbase = $ ; 
50 } 
5.1, 
pas pring “ns 


DB<11> 
Despite the few repeated lines resulting from the w command, you can see the whole loop. 
Now you know something in here is going wrong: when you tested the program without 
giving it an argument, as it's running now, it took the default argument TA, and so far it 
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seemed okay. However, all it actually did in your test was to print out the TA when it was 
supposed to print out everything in the string starting with the first occurrence of TA. 
What's going wrong? 


6.6.3.5 Setting breakpoints 


To figure out what's wrong, you can set a breakpoint in your code. A breakpoint is a spot 
in your program where you tell the debugger to stop execution so you can poke around in 
the code. The Perl debugger lets you set breakpoints in various ways. They let you run 
the program, stopping only to examine it when a statement with a breakpoint is reached. 
That way, you don't have to step through every line of code. (If you have 5,000 lines of 
code, and the error happens when you hit a line of code that's first used when you're 
reading the 12,000th line of input, you'll be happy about this feature.) 


Notice that the part of this loop that prints out the rest of the string, once the starting two 
bases have been found, is the if block starting at line 40: 
if (Sreceivingcommittment) { 
Print; 
next; 


} 
Let's look at that Sreceivingcommittment variable. 


Here's one way to do this. Let's set a breakpoint at line 40. Type b 40 and then c to 
continue, and the program proceeds until it hits line 40: 


DB<11> b 40 

DB<b2> 3c 

main:: (example6-4:40): if (Sreceivingcommittment) { 
DB<12> p 




















DB<12> 


The last command, p , prints out the element from the @dna array you reached in the 
foreach loop. Since you didn't specify a variable for the loop, it used the default $_ 
variable. Many Perl commands such as print or pattern matching operate on the default 
$_ variable if no other variable is given. (It's the cousin of the @ default array 


subroutines used to hold their parameters.) So the p debugger command shows that 
you're operating on C from the @dna array, which is the first character. 


All well and good. But it would be good to have the program break when the variable 
Sreceivingcommittment has a change in its value, and then single step from there, 
to see why the program isn't printing out the rest of the string. Recall that this variable is 
the flag whose change tells the program to print the rest of the string. First let's delete all 
other breakpoints: 


DB<12> D 
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Deleting all breakpoints... 

You can "watch" the variable with W like so: 
DB<12> W Sreceivingcommittment 
DB< L3G 








TA 
Debugged program terminated. Use gq to quit or R to restart, 

use O inhibit exit to avoid stopping after program 
termination, 

hq, h Ror h O to get additional info. 

DB<13> 
Wait a minute! The W command should indicate when Sreceivingcommittment 
changes value. But when the program continued running with the c command, it ran to 
the end, meaning that $receivingcommittment never changed value. So let's start 
up the program again and break on the line that changes its value: 

DB<13> R 
Warning: some settings and command-line options may be lost! 
Default die handler restored. 











Loading DB routines from perl5db.pl version 1.07 
Editor support available. 


Enter h or 'h h' for help, or 'man perldebug' for more help. 

















main:: (example6-4:11): my S$dna = 'CGACGTCTTCTAAGGCGA'; 
DB<13> w 45 
42: next; 
43 } elsif (Spreviousbase eq Sbasel) { 
44: if ( /Sbase2/ ) { 
Ans print Sbasel, Sbase2; 
46: Srecievingcommitment = 1; 
47 } 
48 } 
49: $Spreviousbase = $ ; 
50 } 
51 
DB<14> b 46 
DB<LS> -¢ 
TAmain:: (example6-4:46): Srecievingcommitment = 
1; 
DB<15> n 
main:: (example6-4:49): Spreviousbase = $ ; 
DB<15> p S$receivingcommittment 
DB<16> 








Huh? The code says it's assigning the variable a value of 1, but after you execute the code, 
with the n and try to print out the value, it doesn't print anything. 
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If you stare harder at the program, you see that at line 66 you misspelled 
Sreceivingcommittment as Srecievingcommitment. That explains 
everything; fix it and run it again: 


S perl example6-4 
TAAGGCGA 


Success! 
6.6.3.6 Fixing another bug 


Now, did that fix the other bug when you ran Example 6-4 with an argument? 

S$ perl example6-4 AA 

GACGTCTTCTAAGGCGA 

Again, huh? You expected AAGGCGA. Can there be another bug in the program? Let's try 
the debugger again: 

S perl -d example6-4 AA 

Default die handler restored. 





Loading DB routines from perl5db.pl version 1.07 
Editor support available. 


Enter h or 'h h' for help, or 'man perldebug' for more help. 


















































main:: (example6-4:11): my Sdna = 'CGACGTCTTCTAAGGCGA'; 
DB<1> n 
main:: (example6-4:12): my @dna; 
DB<1> n 
main:: (example6-4:13): my Sreceivingcommittment; 
DB<L> mH 
main:: (example6-4:14): my Spreviousbase = ''; 
DB<1> n 
main:: (example6-4:16): my Ssubsequence = ''; 
DB<1> n 
main:: (example6-4:18): if (@ARGV) { 
DB<1> n 
main:: (example6-4:19): my Ssubsequence = SARGV[0]; 
DB<1> n 
main:: (example6-4:24): my Sbasel = substr(Ssubsequence, 0, 
1); 
DB<il> mn. 
main:: (example6-4:25): my Sbase2 = substr(Ssubsequence, 1, 
ie 
DB<l> ni 
main:: (example6-4:28): @dna = split ( '', Sdna ); 
DB<1> p Ssubsequence 
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DB<2> p Sbasel 


DB<3> p Sbase2 





DB<4> 


Okay, for some reason the Ssubsequence, and therefore the Sbasel and Sbase2 
variables, are not getting set right. How come? 


Check out line 19 where you declared a new my variable in the block of the if statement 
with the same name, Ssubsequence. That's the variable you're setting, but it's 
disappearing as soon as the if statement is over, because it's scoped in the block since 
it's a my variable. 


So again, you fix that problem by removing the my declaration on line 19 
and instead inserting an assignment Ssubsequence = SARGV[0]; and run the 
program again: 


S$ perl example6-4 
TAAGGCGA 
S perl example6-4 AA 
AAGGCGA 


Here, finally, is success. 
6.6.3.7 use warnings; and use strict; redux 


Example 6-4 was somewhat artificial. It turns out that these problems would have 
been reported easily if warnings had been used. So let's see an actual example of the 
benefits of use strict; and use warnings;, as discussed earlier in this chapter. 


If you go back to the original Example 6-4 and add the use warnings; directive 
near the top of the program, you get the following output: 


S$ perl example6-4 

Name "main::recievingcommitment" used only once: possible 
typo at example6-4 line 47. 

TA 


As you see, the warnings found the first bug immediately. They noticed there was a 
variable that was used only once, usually a sign of a misspelled variable. (I can never 
spell "receiving" or "commitment" properly.) So fix the misspelling at line 66, and run it 
again: 


S perl example6-4 
TAAGGCGA 
S$ perl example6-4 AA 
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substr outside of string at exampleo-4 line 26. 

Use of uninitialized value in regexp compilation at 
example6-4 line 45. 
Use of uninitialized value in print at example6é-4 line 46. 
GACGTCTTCTAAGGCGA 








So, the first bug is fixed. The second bug remains with a few warnings that are, perhaps, 
hard to understand. But focus on the first error message, and see that it complains about 
line 26: 


my Sbase2 = substr(Ssubsequence, 1, 1); 


So, there's something wrong with $subsequence. Often, error messages will be off by 
one line, so it may well be that the error starts on the line before, the first time 
Ssubsequence is operated on by the substr. But that's not the case here. 


Nonetheless, the warnings have pointed directly to the problem. In this case, you still 
have to take a little initiative; look back at the Ssubsequence variable and notice the 
extra my declaration within the if block on line 20 that is preventing the variable from 
being initialized properly. Now this is not necessarily always a bug—declaring a variable 
scoped within a block and that overrides another variable of the same name that is outside 
the block. In fact, it's perfectly legal, so the programmers who wrote the warnings did not 
flag it as an obvious error. However, it seems to have caused a real problem here! 


One final point: if you go back to the original, buggy program, notice 
there's no use strict; in the program. If you add that and run the program without 
arguments, you get the following: 


S$ perl example6-4 

Global symbol "Srecievingcommitment" requires explicit 
package name at example6-4 line 47. 

Execution of example6-4 aborted due to compilation errors. 











Fixing the misspelled variable, and running the program with the argument, you get: 


S$ perl example6o-4 AA 

GACGTCTTCTAAGGCGA 

You can see that use strict; didn't help for the other bug. Remember, it's best to 
employ both use strict; and use warnings;. 





6.7 Exercises 


Exercise 6.1 
Write a subroutine to concatenate two strings of DNA. 


Exercise 6.2 
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Write a subroutine to report the percentage of each nucleotide in DNA. You've 
seen the plus operator +. You will also want to use the divide operator / and the 
multiply operator *. Count the number of each nucleotide, divide by the total 
length of the DNA, then multiply by 100 to get the percentage. Your arguments 
should be the DNA and the nucleotide you want to report on. The int function 
can be used to discard digits after the decimal point, if needed. 


Exercise 6.3 


Write a subroutine to prompt a user with any message, and collect the user's 
answer. The subroutine's argument should be the message, and the return value 
should be the (one-line) answer. 


Exercise 6.4 


Write a subroutine to look for command-line arguments such as —help, —h, and 
—-help. Recall that command-line arguments appear in the @ARGV array. Call 
your subroutine from a main program. If you give the program any of the named 
command-line arguments, when you pass them into the subroutine it should return 
a true value. If this is the case, have the program print out a help message in a 
SUSAGE variable and exit. 


Exercise 6.5 


Write a subroutine to check if a file exists, is a regular file, and is nonzero in size. 
Use the file test operators (See Appendix B). 


Exercise 6.6 


Use Exercise 6.3 in a subroutine that keeps prompting until a valid file is entered 
by the user or until five attempts have failed. 


Exercise 6.7 


Write a module that contains subroutines that report various statistics on DNA 
sequences, for instance length, GC content, presence or absence of poly-T 
sequences (long stretches of mostly T's at the 5' (left) end of many SDNA 
sequences), or other measures of interest. 


Exercise 6.8 


Write a subroutine to do something a biologist normally does. (Here's an 
opportunity to look around the lab and write a useful program!) 


Exercise 6.9 


Read the documentation about the debugger and become familiar with its use by 
applying it during your programming. 


Exercise 6.10 


Write a subroutine that alters an array of lines in a file. Use pass by reference for 
the array. Pass the subroutine a reference to the array, a regular expression, and a 
string to replace the regular expression. All the lines of the array should be altered 
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by substituting the matches found for the regular expression by the replacement 
string. 
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Chapter 7. Mutations and Randomization 


As every biologist knows, mutation is a fundamental topic in biology. Mutations in DNA 
occur all the time in cells. Most of them don't affect the actions of proteins and are benign. 
Some of them do affect the proteins and may result in diseases such as cancer. Mutations 
can also lead to nonviable offspring that dies during development; occasionally they can 
lead to evolutionary change. Many cells have very complex mechanisms to repair 
mutations. 


Mutations in DNA can arise from radiation, chemical agents, replication errors, and other 
causes. We're going to model mutations as random events, using Perl's random number 
generator. 


Randomization is a computer technique that crops up regularly in everyday programs, 
most commonly in cryptography, such as when you want to generate a hard-to-guess 
password. But it's also an important branch of algorithms: many of the fastest algorithms 
employ randomization. 


Using randomization, it's possible to simulate and investigate the mechanisms of 
mutations in DNA and their effect upon the biological activity of their associated proteins. 
Simulation is a powerful tool for studying systems and predicting what they will do; 
randomization allows you to better simulate the "ordered chaos" of a biological system. 
The ability to simulate mutations with computer programs can aid in the study of 
evolution, disease, and basic cellular processes such as division and DNA repair 
mechanisms. Computer models of cell development and function, now in their early 
stages, will become much more accurate and useful in coming years, and mutation is a 
basic biological mechanism these models will incorporate. 


From the standpoint of programming technique, as well as from the standpoint of 
modeling evolution, mutation, and disease, randomization is a powerful—and, luckily for 
us, easy-to-use—programming skill. 


Here's a breakdown of what we will accomplish in this chapter: 


Randomly select an index into an array and a position in a string: these are the basic tools 
for picking random locations in DNA (or other data) 


Model mutation with random numbers by learning how to randomly select a nucleotide in 
DNA and then mutate it to some other (random) nucleotide 


Use random numbers to generate DNA sequence data sets, which can be used to study the 
extent of randomness in actual genomes 


Repeatedly mutate DNA to study the effect of mutations accumulating over time during 
evolution 
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7.1 Random Number Generators 


A random number generator is a subroutine you can call. For most practical purposes, 
you needn't worry about what's inside it. The values you get for random numbers on the 
computer differ somewhat from the values of real-world random events as measured, for 
example, by detecting nuclear decay events. Some computers actually have devices such 
as geiger counters attached so as to have a source of truly random events. But I'd be 
willing to bet your computer doesn't. What you have in place of a geiger counter, is an 
algorithm called a random number generator. 


The numbers that are output by random number generators are not really random; they 
are thus called pseudo-random numbers. A random number generator, being an algorithm, 
is predictable. A random number generator needs a seed, an input you can change to get a 
different series of (pseudo-)random numbers. 


The numbers from a random number generator give an even distribution of values. This 
is one of the most important characteristics of randomness and largely justifies the use of 
these algorithms where some amount of random behavior is desired. 


The other "take-home message" about random number generators is that the seed you 
start them up with should itself be selected randomly. If you seed with the same number 
every time, you'll get the same sequence of "random numbers" every time as well. (Not 
very random!) Try to pick a seed that has some randomness in it, such as a number 
calculated from some computer event that changes haphazardly over time.” 


[1 Even here, for critical applications, you're not out of the woods. Unless you pick your seeds 
carefully, hackers will figure out how you're picking them and crack your random numbers and 
therefore your passwords. The method used to generate seeds in this chapter, time|$$, is 
crackable by dedicated hackers. A better choice is time() * ($$+<<15)). If program security 
is important, you should consult the Perl documentation, and the Math::Random and 

Math: :TrulyRandom modules from CPAN 


In the examples that follow, I use a simple method for seed picking that's okay for most 
purposes. If you use random numbers for data encryption with critical privacy issues 
(such as patient records), you should read further into the Perl documentation about the 
several advanced options Perl provides for random number generation. In this book, I use 
a Perl method that is good enough for most purposes. 


7.2 A Program Using Randomization 


Example 7-1 introduces randomization in the context of a simple program. It 
randomly combines parts of sentences to construct a story. This isn't a bioinformatics 
program, but I've found that it's an effective way to learn the basics of randomization. 
You will learn how to randomly select elements from arrays, which you'll apply in the 
future examples that mutate DNA. 


The example declares a few arrays filled with parts of sentences, then randomizes their 
assembly into complete sentences. It's a trivial children's game; yet it teaches several 
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programming points. 


Example 7-1. Children's game with random numbers 


#!/usr/bin/perl 


# Children's game, demonstrating primitive artificial 





intelligence, 


# using a random number generator to randomly select parts 


of sentences. 


use strict; 
use warnings; 


# Declare the variables 
my Scount; 
my Sinput; 
my Snumber; 
my Ssentence; 
my Sstory; 








# Here are the arrays of parts of sentences: 


my @nouns = ( 
"Dad', 

yy 

"Mom', 
"Groucho', 
"Rebecca', 
"Harpo', 
"Robin Hood', 
"Joe and Moe', 


ve 


my @verbs = ( 
"ran to"; 
"giggled with', 


‘out hot sauce into the orange juice of', 


"exploded', 

"dissolved', 

"sang stupid songs with', 
"Jumped with', 

); 


my @prepositions = ( 

"at the store', 

‘over the rainbow', 

"Just for the fun of it', 
"at the beach', 
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"before dinner', 
‘in New York City', 
‘in a dream', 
"around the world', 


hg 


# Seed the random number generator. 

# time|SS combines the current time with the current 
process id 

# in a somewhat weak attempt to come up with a random seed. 
srand(time|$$) ; 


# This do-until loop composes six-sentence "stories". 




















# until the user types "quit". 
do { 

# (Re)set Sstory to the empty string each time through 
the loop 

Sstory = ''; 

# Make 6 sentences per story. 

for (Scount = 0; Scount < 6; Scount++) { 

# Notes on the following statements: 

# 1) scalar @array gives the number of elements in 

the array. 

# 2) rand returns a random number greater than 0 

and 

i less than scalar(@array). 

# 3) int removes the fractional part of a number. 

# 4) . joins two strings together. 

Ssentence = Snouns[int(rand(scalar @nouns) ) ] 
Sverbs[int(rand(scalar @verbs) ) ] 
Snouns[int(rand(scalar @nouns) ) ] 
Sprepositions [int (rand(scalar 

@prepositions) ) ] 


' Ve. 
= , 


Sstory .= Ssentence; 





} 





# Print the story. 
peing “ai Secor, ns 


# Get user input. 
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print "\nType \"quit\" to quit, or press Enter to 


continue: : 


Sinput = <STDIN>; 





# Exit loop at user's request 
} until($input =~ /*\s*q/i); 


exit; 

Here is some typical output from Example 7-1: 

Joe and Moe jumped with Rebecca in New York City. Rebecca 
exploded Groucho 

in a dream. Mom ran to Harpo over the rainbow. TV giggled 
with Joe and Moe 

over the rainbow. Harpo exploded Joe and Moe at the beach. 
Robin Hood giggled 

with Harpo at the beach. 


Type "quit" to quit, or press Enter to continue: 








Harpo put hot sauce into the orange juice of TV before 
dinner. Dad ran to 

Groucho in a dream. Joe and Moe put hot sauce into the 
orange juice of TV 

in New York City. Joe and Moe giggled with Joe and Moe over 
the rainbow. TV 

put hot sauce into the orange juice of Mom just for the fun 
of it. Robin Hood 

ran to Robin Hood at the beach. 





Type "quit" to quit, or press Enter to continue: quit 


The structure of the example is quite simple. After enforcing the declarations of variables, 
and turning on warnings, with: 


use strict; 
use warnings; 


the variables are declared, and the arrays are initialized with values. 
7.2.1 Seeding the Random Number Generator 


Next, the random number generator is seeded by a call to the built-in function srand. It 
takes one argument, the seed for the random number generator discussed earlier. As 
mentioned, you have to give a different seed at this step to get a different series of 
random numbers. Try changing this statement to something like: 

srand(100); 
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and then run the program more than once. You'll get the same results each time.” The 
seed you're using: 


21 The latest random number generators automatically change the series, so if this experiment 
doesn't work, you're probably using a very new random number generator. However, 
sometimes you want to repeat a series. Note that newer versions of Perl automatically give 
you a good seed if you call srand like so: srand;. 


time|$$ 
is a calculation that returns a different seed each time. 


time returns a number representing the time, $$ returns a number representing the ID of 
the Perl program that's running (this typically changes each time you run the program), 
and | means bitwise OR and combines the bits of the two numbers (for details see the Perl 
documentation). There are other ways to pick a seed, but let's stick with this popular one. 


7.2.2 Control Flow 


The main loop of the program is a do-until loop. These loops are handy when you 
want to do something (like print a little story) before taking any actions (like asking the 
user if he wants to continue) each time through the loop. The do-until loop first 
executes the statements in the block and then performs a test to determine if it should 
repeat the statements in the block. Note that this is the reverse of the other types of loops 
you've seen that do the test first and then the block. 


Since the $story variable is always being appended to, it needs to be emptied at the top 
of each loop. It's common to forget that variables that are increased in some way need to 
be reset at the correct spot, so watch for that in your programming. The clue is 
increasingly long strings or big numbers. 


The for loop contains the main work of the program. As you've seen before, this loop 


initializes a counter, performs a test, and then increments the counter at the end of the 
block. 


7.2.3 Making a Sentence 


In Example 7-1, note that the statement that makes a sentence stretches out over a few 
lines of code. It's a bit complicated, and it's the real work of the whole program, so there 
are comments attached to help read it. Notice that the statement has been carefully 
formatted so that it's neatly laid out over its eight lines. The variable names have been 
well chosen, so it's clear that you're making a sentence out of a noun, a verb, a noun, and 
a prepositional phrase. 


However, even with all that, there are rather deeply nested expressions within the square 
brackets that specify the array positions, and it requires a bit of scrutiny to read this code. 
You will see that you're building a string out of sentence parts separated by spaces and 
ending with a period and a space. The string is built by several applications of the dot 
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string concatenation operator. These have been placed at the beginning of each line to 
clarify the overall structure of the statement. 


7.2.4 Randomly Selecting an Element of an Array 


Let's look closely at one of the sentence part selectors: 
Sverbs [int (rand(scalar @verbs) ) ] 


These kinds of nested braces need to be read and evaluated from the inside out. So the 
expression that's most deeply surrounded by braces is: 


scalar @verbs 

You see from the comments before the statement that the built-in function Sca/ar returns 
the number of elements in an array. The array in question, @verbs, has seven elements, 
so this expression returns 7. 


So now you have: 
Sverbs [int (rand(7))] 

and the most deeply nested expression is now: 
rand (7) 


The helpful comments in the code before the statement remind you that this statement 
returns a (pseudo)random number greater than 0 and less than 7. This number is a 
floating-point number (decimal number with a fraction). Recall that an array with seven 
elements will number them from 0 to 6. 


So now you have something like this: 
Sverbs [int (3.47429) ] 
and you want to evaluate the expression: 


int (3.47429) 
The /nt function discards the fractional part of a floating-point number and returns just 
the integer part, in this case 3. 


So you've come to the final step: 


Sverbs [3] 
which gives you the fourth element of the @verbs array, as the comments have been 
kind enough to remind you. 


7.2.5 Formatting 
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To randomly select a verb, you call a few functions: 


scalar 
Determines the size of the array 
rand 
Picks a random number in the range determined by the size of the array 
int 
Transforms the floating-point number rand returns into the integer value you 


need for an array element 


Several of these function calls are combined in one line using nested braces. Sometimes 
this produces hard-to-read code, and the gentle reader may be nodding his or her head 
vigorously at this unflattering characterization of the author's painstaking handiwork. 
You could try rewriting these lines, using additional temporary variables. For instance, 
you can say: 


Sverb array size = scalar @verbs; 

Srandom floating point = rand ( $verb array size ); 
Srandom_ integer = int Srandom_ floating point; 

Sverb = $verbs[Srandom_integer] ; 


and repeat for the other parts of speech, finally building your sentence with a statement 
such as: 


ssentence = "Ssubject $verb Sobject $Sprepositional phrase. 


We 
, 


It's a matter of style. You will make these kinds of choices all the time as you program. 
The choice of layout in Example 7-1 was based on a tradeoff between a desire to 
express the overall task clearly (which won) balanced against the difficulty of reading 
highly nested function calls (which lost). Another reason for this layout choice is that, in 
the programs that follow, you'll select random elements in arrays with some regularity, so 
you'll get used to seeing this particular nesting of calls. In fact, perhaps you should make 
a little subroutine out of this kind of call if you will do the same thing many times? 


Readability is the most important thing here, as it is in most code. You have to be able to 
read and understand code, your own as well as the code of others, and that is usually 
more important than trying to achieve other laudable goals such as fastest speed, smallest 
amount of memory used, or shortest program. It's not always important, but usually it's 
best to write for readability first, then go back and try to goose up the speed (or whatever) 
if necessary. You can even leave the more readable code in there as comments, so 
whoever has to read the code can still get a clear idea of the program and how you went 
about improving the speed (or whatever). 


7.2.6 Another Way to Calculate the Random Position 
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Perl often has several ways to accomplish a task. the following is an alternate way to 
write this random number selection; it uses the same function calls but without the 
parentheses: 


Sverbs[int rand scalar @verbs] 

This chaining of functions, each of which takes one argument, is common in Perl. To 
evaluate the expression, Perl first takes @verbs as an argument to Sca/ar, which returns 
the size of the array. Then it takes that value as an argument to rand, which returns a 
floating-point number from 0 to less than the size of the array. It then uses that floating- 
point number as an argument to /nt, which returns the greatest integer less than the 
floating-point number. In other words, it calculates the same number to be used as the 
subscript for the array @verbs. 


Why does Perl allow this? Because such calculations are very frequent, and, in the spirit 
of "Let the computer do the work," Perl designer Larry Wall decided to save you (and 
himself) the bother of typing and matching all those parentheses. 


Having gone that far, Larry decided it'd be easy to add even more. You can eliminate the 
scalar and the int function calls and use: 

Sverbs[rand @verbs] 

What's going on here? Since rand already expects a scalar value, it evaluates @verbs in 
a scalar context, which simply returns the size of the array. Larry cleverly designed array 
subscripts (which, of course, are always integer values) to automatically take just the 
integer part of a floating-point value if it was given as a subscript; so, out with the int. 


7.3 A Program to Simulate DNA Mutation 


Example 7-1 gave you the tools you'll need to mutate DNA. In the following examples, 
you'll represent DNA, as usual, by a string made out of the alphabet A, C, G, and T. 
You'll randomly select positions in the string and then use the substr function to alter 
the DNA. 


This time, let's go about things a little differently and first compose some of the useful 
subroutines you'll need before showing the whole program. 


7.3.1 Pseudocode Design 


Starting with simple pseudocode, here's a design for a subroutine that mutates a random 
position in DNA to a random nucleotide: 


Select a random position in the string of DNA. 
Choose a random nucleotide. 


Substitute the random nucleotide into the random position in the DNA. 
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This seems short and to the point. So you decide to make each of the first two sentences 
into a subroutine. 


7.3.1.1 Select a random position in a string 


How can you randomly select a position in a string? Recall that the built-in function 
length returns the length of a string. Also recall that positions in strings are numbered 
from 0 to length-1, just like positions in arrays. So you can use the same general idea 
as in Example 7-1, and make a subroutine: 


# randomposition 
A subroutine to randomly select a position in a string. 


WARNING: make sure you call srand to seed the 
random number generator before you call this function. 


it 
tt 
it 
it 
it 


sub randomposition { 


my(Sstring) =— € ; 


# This expression returns a random number between O and 
length-1, 

# which is how the positions in a string are numbered 
in Perl. 


return int (rand(length(Sstring))); 
} 
randomposition is really a short function, if you don't count the comments. It's just 
like the idea in Example 7-1 to select a random array element. 


Of course, if you were really writing this code, you'd make a little test to see if your 
subroutine worked: 


#!/usr/bin/perl -w 
# Test the randomposition subroutine 


my Sdna = 'AACCGTTAATGGGCATCGATGCTATGCGAGCT'!; 
srand(time|$$) ; 
for {my Si=0 7 S81.< 20 5 4¢51.) 4 
print randomposition(Sdna), " "; 
} 
prank “\n"; 


exit; 
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sub randomposition { 
my($string) = @ ; 
return int rand length $string; 


Here's some representative output of the test (your results should vary): 


26 26 20 129 7 2 27 2 2464-23 7 13:14 2 12 13 27 

Notice the new look of the for loop: 

for (my Si1=0 ¢ Si <= 20 ¢ 4462 ) | 

This shows how you can localize the counter variables (in this case, $i) to the loop by 
declaring them with my inside the for loop. 





7.3.1.2 Choose a random nucleotide 


Next, let's write a subroutine that randomly chooses one of the four nucleotides: 


randomnucleotide 


WARNING: make sure you call srand to seed the 


it 

# 

# A subroutine to randomly select a nucleotide 

# 

it 

# random number generator before you call this function. 


sub randomnucleotide { 


my(@nucs) = @ ; 


# scalar returns the size of an array. 
# The elements of the array are numbered 0 to size-1l 
return Snucs[rand @nucs]; 


} 

Again, this subroutine is short and sweet. (Most useful subroutines are; although writing 
a short subroutine is no guarantee it will be useful. In fact, you'll see in a bit how you can 
improve this one.) 


Let's test this one too: 


#!/usr/bin/perl -w 
# Test the randomnucleotide subroutine 


my @nucleotides = ('A', 'C', 'G', 'T'); 
srand(time|$S); 


for (my $i=0 ; $i < 20 ; ++Si ) f{ 
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print tancomnucleotide (tnucleorides), " "; 


} 
pring “\n"; 
exit; 


sub randomnucleotide { 
my(@nucs) = @ ; 


return Snucs[rand @nucs]; 


} 


Here's some typical output (it's random, of course, so there's a high probability your 
output will differ): 


CAAAATTTTTTACACTAAGGEGEG 
7.3.1.3 Place a random nucleotide into a random position 


Now for the third and final subroutine, that actually does the mutation. Here's the code: 


# mutate 
it 
# A subroutine to perform a mutation in a string of DNA 


it 


sub mutate f{ 


my(Sdna) = @ ; 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 


# Pick a random position in the DNA 
my (Sposition) = randomposition(Sdna) ; 


# Pick a random nucleotide 
my (Snewbase) = randomnucleotide (@nucleotides) ; 


# Insert the random nucleotide into the random position 
in the DNA. 

# The substr arguments mean the following: 

# In the string Sdna at position Sposition change 1 
character to 

# the string in Snewbase 

substr (S$dna, $Sposition,1,$newbase) ; 


return Sdna: 
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Here, again, is a short program. As you look it over, notice that it's relatively easy to read 
and understand. You mutate by picking a random position then selecting a nucleotide at 
random and substituting that nucleotide at that position in the string. (If you've forgotten 
how substr works, refer to Appendix B or other Perl documentation. If you're like me, 
you probably have to do that a lot, especially to get the order of the arguments right.) 


There's a slightly different style used here for declaring variables. Whereas you've been 
declaring them at the beginning of a program, here you're declaring each variable the first 
time it's used. There are pros and cons for each programming style. Having all the 
variables at the top of the program gives good organization and can help in reading; 
declaring them on-the-fly can seem like a more natural way to write. The choice is yours. 


Also, notice how this subroutine is mostly built from other subroutines, with a little bit 
added. That has a lot to do with its readability. At this point, you may be thinking that 
you've actually decomposed the problem pretty well, and the pieces are fairly easy to 
build and, in the end, they fit together well. But do they? 


7.3.2 Improving the Design 


You're about to pat yourself on the back for writing the program so quickly, but you 
notice something. You keep having to declare that pesky @nucleotides array and 
then pass it in to the randomnucleotide subroutine. But the only place you use the 
array is inside the randomnucleotide subroutine. So why not change your design a 
little? Here's a new try: 

# randomnucleotide 


A subroutine to randomly select a nucleotide 


# 
# 
# 
# WARNING: make sure you call srand to seed the 
# random number generator before you call this function. 
sub randomnucleotide { 

my(@nucs) = ('A', 'C', 'G', 'T'); 


# scalar returns the size of an array. 
# The elements of the array are numbered 0 to size-1l 
return Snucs[rand @nucs]; 


} 
Notice that this function now has no arguments. It's called like so: 
Srandomnucleotide = randomnucleotide( ); 


It's asking for a random element from a very specific set. Of course, you're always 
thinking, and you say, "It'd be handy to have a subroutine that randomly selects an 
element from any array. I might not need it right now, but I bet I'll need it soon!" So you 
define two subroutines instead of one: 
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randomnucleotide 
A subroutine to randomly select a nucleotide 


# 
# 
# 
# 
# WARNING: make sure you call srand to seed the 

# random number generator before you call this function. 
sub randomnucleotide { 

my (@nucleotides) = ('A', 'C', 'G', 'T'); 


# scalar returns the size of an array. 
# The elements of the array are numbered 0 to size-1l 
return randomelement (@nucleotides) ; 


} 
randomelement 
A subroutine to randomly select an element from an array 


WARNING: make sure you call srand to seed the 
random number generator before you call this function. 


it 
it 
tt 
tt 
it 
it 


sub randomelement {f 


my(@array) = @ ; 


return Sarray[rand @array]; 
} 
Look back and notice that you didn't have to change your subroutine mutate; just the 
internal workings of randomnucleotide changed, not its behavior. 


7.3.3 Combining the Subroutines to Simulate Mutation 


Now you've got all your ducks in place, so you write your main program as in Example 
7-2 and see if your new subroutine works. 


Example 7-2. Mutate DNA 

#!/usr/bin/perl 

# Mutate DNA 

# uSing a random number generator to randomly select bases 


to mutate 


use strict; 
use warnings; 


# Declare the variables 


IT-SC 148 


# The DNA is chosen to make it easy to see mutations: 
my SDNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAB! ; 


# Si is a common name for a counter variable, short for 
"integer" 
my $i; 


my Smutant; 


# Seed the random number generator. 

# time|$S combines the current time with the current 
process id 

srand(time|$$); 





# Let's test it, shall we? 
Smutant = mutate(S$DNA); 


print "\nMutate DNA\n\n"; 

print "\nHere is the original DNA:\n\n"; 
print "SDNA\n"; 

print "\nHere is the mutant DNA:\n\n"; 
print. “smutant\n"; 





mutations: 
print "\nHere are 10 more successive mutations:\n\n"; 





# Let's put it in a loop and watch that bad boy accumulate 


for (Si=0: > Sa < 10 7 #4Sa) 4 
Smutant = mutate (Smutant); 
pYinke "Smutant \n"; 





} 


exit; 

Ht HH HH HH HH HEH EH HH EE HE HEH HE EE HH EE EH HH EOE EE HH 
aT HH HHH HH HH HE HH HH HH 

# Subroutines for Example 7-2 

Hat tH HH HH HH HEH HH EH EE HEH HE EE EH HE EE EH EH HH 
HHH HH HEH HE HHH EH EE HEH HE 








# Notice, now that we have a fair number of subroutines, 
we 
# list them alphabetically 


# A subroutine to perform a mutation in a string of DNA 
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# 
# WARNING: make sure you call srand to seed the 
# random number generator before you call this function. 


sub mutate { 
my(S$dna) = @ ; 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 


# Pick a random position in the DNA 
my (Sposition) = randomposition(Sdna) ; 


# Pick a random nucleotide 
my (Snewbase) = randomnucleotide (@nucleotides) ; 


# Insert the random nucleotide into the random position 
in the DNA 

# The substr arguments mean the following: 

# In the string Sdna at position Sposition change 1 
character to 

# the string in Snewbase 

substr ($dna, $position,1,$newbase) ; 


return Sdna; 


} 


# A subroutine to randomly select an element from an array 


# 
# WARNING: make sure you call srand to seed the 
# random number generator before you call this function. 


sub randomelement {f{ 


my(@array) = @ ; 
return Sarray[rand @array]; 


} 


# randomnucleotide 

# 

# A subroutine to select at random one of the four 
nucleotides 


# 
# WARNING: make sure you call srand to seed the 
# random number generator before you call this function. 
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sub randomnucleotide { 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 
# scalar returns the size of an array. 
# The elements of the array are numbered 0 to size-1l 


return randomelement (@nucleotides); 


} 
randomposition 
A subroutine to randomly select a position in a string. 


it 
it 
it 
it 
# WARNING: make sure you call srand to seed the 

# random number generator before you call this function. 
sub randomposition { 


my (string) = € > 





Notice the "nested" arguments: 


# 
# 
# Sstring is the argument to length 

# length(S$string) is the argument to rand 

# rand(length(Sstring))) is the argument to int 
# 

# 

# 

# 





int (rand(length(Sstring))) is the argument to return 
But we write it without parentheses, as permitted. 


rand returns a decimal number between 0 and its 

argument. 

# int returns the integer portion of a decimal number. 

it 

# The whole expression returns a random number between 
0 and length-1, 

# which is how the positions in a string are numbered 
in Perl. 


it 








return int rand length Sstring; 


} 
Here's some typical output from Example 7-2: 
Mutate DNA 


Here is the original DNA: 


AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 
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Here is the mutant DNA: 
AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA 
Here are 10 more successive mutations: 


AAAAAAAAAAAAAAAAAAAAGACAAAAAAA 
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA 
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA 
AAAAAAAAAAAAAACAAAAAGACAAAAAAA 
AAAAAAAAAAAAAACAACAAGACAAAAAAA 
AAAAAAAAAAAAAACAACAAGACAAAAAAA 
AAAAAAAAAGAAAACAACAAGACAAAAAAA 
AAAAAATAAGAAAACAACAAGACAAAAAAA 
AAAAAATAAGAAAACAACAAGACAAAAAAA 

AAAAAAT TAGAAAACAACAAGACAAAAAAA 

Example 7-2 was something of a programming challenge, but you end up with the 
satisfaction of seeing your (simulated) DNA mutate. How about writing a graphical 
display for this, so that every time a base gets mutated, it makes a little explosion and the 
color gets highlighted, so you can watch it happening in real-time? 


Before you scoff, you should know how important good graphical displays are for the 
success of most programs. This may be a trivial-sounding graphic, but if you can 
demonstrate the most common mutations in, for instance, the BRCA breast cancer genes 
in this way, it might be useful. 


7.3.4 A Bug in Your Program? 


To return to the business at hand, you may have noticed something when you looked over 
the output from Example 7-2. Look at the first two lines of the "10 more successive 
mutations." They are exactly the same! Could it be that after patting yourself on the back 
and telling yourself what a good bit of work you'd done, you've discovered a bug? 


How can you track it down? You may want to step through the running of the program 
with the Perl debugger, which you saw in Chapter 6. However, this time, you stop and 
think about your design instead. You're replacing the bases at random positions with 
randomly chosen bases. Aha! Sometimes the base at the position you randomly choose is 
exactly the same as the base you randomly choose to plug into its place! You're replacing 
a base with itself on occasion! 


51 How often? In DNA that's all one base, it's happening 1/4 of the time. In DNA that's equally 
populated with the four bases, it's happening...1/4 of the time! 


Let's say you decide that behavior is not useful. At each successive mutation, you need to 
see one base change. How can you alter your code to ensure that? Let's start with some 
pseudocode for the mutate subroutine: 

Select a random position in the string of DNA 
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Repeat: 
Choose a random nucleotide 


Until: random nucleotide differs from the nucleotide in the 
random position 





Substitute the random nucleotide into the random position 
in the DNA 

This seems like something that should work, so you alter the mutate subroutine, calling 
it the mutate_better subroutine: 

# mutate better 

# 

# Subroutine to perform a mutation in a string of DNA-- 
version 2, in which 

# it is guaranteed that one base will change on each call 
it 

# WARNING: make sure you call srand to seed the 

# random number generator before you call this function. 





Sub Mutete better 


my(Sdna) = @ ; 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 


# Pick a random position in the DNA 
my (Sposition) = randomposition(Sdna) ; 


# Pick a random nucleotide 
my (Snewbase) ; 


do { 
Snewbase = randomnucleotide (@nucleotides) ; 





# Make sure it's different than the nucleotide we're 
mutating 
}until ( Snewbase ne substr(S$dna, S$position,1) ); 





# Insert the random nucleotide into the random position 
in the DNA 

# The substr arguments mean the following: 

# In the string Sdna at position Sposition change 1 
character to 

# the string in Snewbase 

substr ($dna, $position,1,$newbase) ; 
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return $dna; 
} 
When you plug this subroutine in place of Mutate and run the code, you get the 
following output: 
Mutate DNA 


Here is the original DNA: 
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 

Here is the mutant DNA: 
AAAAAAAAAAAAATAAAAAAAAAAAAAAAA 

Here are 10 more successive mutations: 


AAAAAAAAAAAAATAAAAAAAACAAAAAAA 
AAAAATAAAAAAATAAAAAAAACAAAAAAA 
AAATATAAAAAAATAAAAAAAACAAAAAAA 
AAATATAAAAAAATAAAAAAAACAACAAAA 
AATTATAAAAAAATAAAAAAAACAACAAAA 
AATTATTAAAAAATAAAAAAAACAACAAAA 
AATTATTAAAAAATAAAAAAAACAACACAA 
AATTATTAAAAAGTAAAAAAAACAACACAA 
AATTATTAAAAAGTGAAAAAAACAACACAA 
AATTATTAAAAAGTGATAAAAACAACACAA 














which seems to indeed make a real change on every iteration. 


Notice one more thing about declaring variables. In this code for mutate_better, if 
you'd declared Snewbase within the loop, since the loop is enclosed in a block, the 
variable Snewbase would not then be visible outside of that loop. In particular, it 
wouldn't be available in the substr call that does the actual base change for the 
mutation. So, in mutate_better, you had to declare the variable outside of the loop. 


This is a frequent source of confusion for programmers who like to declare variables on 
the fly and a powerful argument for getting into the habit of collecting variable 
definitions together at the top of the program. 


Even so, there are often times when you want to hide a variable within a block, because 


that's the only place where you will use it. Then you may want to do the declaration in the 
block . (Perhaps at the top of the block, if it's a long one?) 


7.4 Generating Random DNA 


It's often useful to generate random data for test purposes. Random DNA can also be 
used to study the organization of actual DNA from an organism. In this section, we'll 
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write some programs to generate random DNA sequences. 


Such random DNA sequences have proved useful in several ways. For instance, the 
popular BLAST program (see Chapter 12) depends on the properties of random DNA 
for the analytic and empirical results that underpin the sequence similarity scores, 
statistics that are used to rank the "hits" that BLAST returns to the user. 


Let's assume what's needed is a set of random DNA fragments of varying length. Your 
program will have to specify a maximum and a minimum length, as well as how many 
fragments to generate. 


7.4.1 Bottom-up Versus Top-down 


In Example 7-2, you wrote the basic subroutines, then a subroutine that called the 
basic subroutines, and finally the main program. If you ignore the pseudocode, this is an 
example of bottom-up design; start with the building blocks, then assemble them into a 
larger structure. 


Now let's see what it's like to start with the main program, with its subroutine calls, and 
write the subroutines after you find a need for them. This is called top-down design. 


7.4.2 Subroutines for Generating a Set of Random DNA 


Given our goal of generating random DNA, perhaps what you want is a data-generating 
subroutine: 


@random DNA = make random DNA set( $minimum_ length, 

omaximum length, $size of set }; 

This looks okay, but of course, it begs the question of how to actually accomplish the 
overall task. (That's top-down design for you!) So you need to move down and write 
pseudocode for the make_random_DNA_set subroutine: 

repest Seize of S60 Limes: 


Slength = random number between minimum and maximum 
length 


Sdna = make random DNA ( $length ); 


add S$dna to @set 
} 


return @set 

Now, continuing the top-down design, you need some pseudocode for the 
make_random_DNA subroutine: 

from 1 to $size 


Sbase = randomnucleotide 
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Sdna .= Sbase 
} 


return S$dna 
Don't go any further: you've already got a randomnucleotide subroutine from 
Example 7-2. 


(Are you bothered by the absence of balanced curly braces in the pseudocode? Here, 
you're relying on indentation and lining up the right braces to indicate the blocks. Since 
it's pseudocode, anything is allowed as long as it works.) 


7.4.3 Turning the Design into Code 


Now that we've got a top-down design, how to proceed with the coding? Let's follow the 
top-down design, just to see how it works. 


Example 7-3 starts with the main program and proceeds, following the order of the 
top-down design you did in pseudocode, then followed by the subroutines. 


Example 7-3. Generate random DNA 


#!/usr/bin/perl 
# Generate random DNA 
# uSing a random number generator to randomly select bases 


use strict; 
use warnings; 


# Declare and initialize the variables 
my $size of set = 12; 

my Smaximum_ length = 30; 

my Sminimum length = 15; 








# An array, initialized to the empty list, to store the DNA 
in 
my @random_DNA = (_ ); 


# Seed the random number generator. 

# time|$S combines the current time with the current 
process id 

srand(time|$$); 


# And here's the subroutine call to do the real work 
@random DNA = make random DNA set( $minimum_length, 
oiaximum length, $size of set ); 
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# Print the results, one per line 

print "Here is an array of $size of set randomly generated 
DNA sequences\n"; 

print " with lengths between $minimum_length and 
Smaximum_length:\n\n"; 


foreach my $dna (@random DNA) { 


print. "Sdna\n"; 
} 


pring “Wn; 
exit; 


Heat HH at at HH a a HH HE aE EE HE aE EH HE aE aE EE EE EE aE EE HE aE EE HE aE aE EOE aE EE EE HE SE 
Hat tH HH Ht HH HH HE HH HE EH 

# Subroutines 

Hat H Ht at at HH a aE HH aE EE HE aE aE HE aE aE EE EE EE EE HE HEE aE EH HE aE aE EOE aE aE EE EE HO SE 
Hit tt HH HH HH HH HH EH HE EH 


# make random DNA set 





# 

# Make a set of random DNA 

# 

# Accept parameters setting the maximum and minimum 
length of 

# each string of DNA, and the number of DNA strings to 
make 

# 


# WARNING: make sure you call srand to seed the 
# random number generator before you call this function. 


sub make _random_DNA set { 


# Collect arguments, declare variables 
my(Sminimum_ length, Smaximum_length, $size of set) = @ ; 


# length of each DNA fragment 
my Slength; 


# DNA fragment 
my Sdna; 


# set of DNA fragments 
my @set; 
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# Create set of random DNA 
for (my Si = 07 SL < Ssize of set 7 ++S1) 4 


# find a random length between min and max 
Slength = randomlength ($minimum_ length, 
Smaximum length) ; 








# make a random DNA fragment 
Sdna = make random_DNA ( Slength ); 


# add S$dna fragment to @set 
push( @set, Sdna ); 
} 


return @set; 


Notice that we've just discovered a new subroutine that's 
needed: randomlength, which will return a random 

number between (or including) the min and max values. 
Let's write that first, then do make_random_DNA 


se OSE OSE OSE 


randomlength 


A subroutine that will pick a random number from 
Sminlength to Smaxlength, inclusive. 








WARNING: make sure you call srand to seed the 
random number generator before you call this function. 


se Se Se OSE SE OE SE 


sub randomlength { 


# Collect arguments, declare variables 
my(Sminlength, $maxlength) = @ ; 





# Calculate and return a random number within the 

# desired interval. 

# Notice how we need to add one to make the endpoints 
inclusive, 

# and how we first subtract, then add back, Sminlength 
to 

# get the random number in the correct interval. 

return ( int(rand($maxlength - Sminlength + 1)) + 
Sminlength ); 
} 


# make random DNA 
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# 

# Make a string of random DNA of specified length. 
# 

# WARNING: make sure you call srand to seed the 


# random number generator before you call this function. 


sub make _random_DNA { 


# Collect arguments, declare variables 
my(Slength) = @ ; 








my Sdna; 


for (my S$i=0 ; $i < $length ; ++S$i) { 





Sdna .= randomnucleotide( ); 


} 


return Sdna; 


# We also need to include the previous subroutine 

# randomnucleotide. 

# Here it is again for completeness. 

# randomnucleotide 

# 

# Select at random one of the four nucleotides 

# 

# WARNING: make sure you call srand to seed the 

# random number generator before you call this function. 


sub randomnucleotide { 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 
# scalar returns the size of an array. 


# The elements of the array are numbered 0 to size-1l 
return randomelement (@nucleotides) ; 


# randomelement 

it 

# randomly select an element from an array 

# 

# WARNING: make sure you call srand to seed the 

# random number generator before you call this function. 
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sub randomelement {f{ 


my(@array) = @ ; 
return Sarray[rand @array]; 
} 
Here's the output from Example 7-3: 
Here is an array of 12 randomly generated DNA sequences 
with lengths between 15 and 30: 


TACGCTTGTGTTTTCGGGGGAC 
GGGGTGTGGTAAGGCTGTCTCAGATGTGC 
TGAACGACAACCTCCTGGACTTTACT 
ATCTATGCTTTGCCATGCTAGT 
CCCCTCATTCCTCTTCCTCGGC 
TGTACCCCTAATACACTTTAGCCGAATTTA 
ATAGGTCGGGGCGACAGCGCCGG 
GATTGACCTCTGTAA 
AAAATCTCTAGGATCGAGC 
GTATGTGCTTGGGTAAAT 
ATGGAGTTGCGAGGAAGTAGCTGAGT 
GGCCCATGACCAGCATCCAGACAGCA 





7.5 Analyzing DNA 


In this final example dealing with randomization, you'll collect some statistics on DNA in 
order to answer the question: on average, what percentage of bases are the same between 
two random DNA sequences? Although some simple mathematics can answer the 
question for you, the point of the program is to show that you now have the necessary 
programming ability to ask and answer questions about your DNA sequences. (If you 
were using real DNA, say a collection of some particular gene as it appears in several 
organisms in slightly different forms, the answer would be somewhat more interesting. 
You may want to try that later.) 


So let's generate a set of random DNA, all the same length, then ask the following 
question about the set. What's the average percentage of positions that are the same 
between pairs of DNA sequences in this set? 


As usual, let's try to sketch an idea of the program in pseudocode: 


Generate a set of random DNA sequences, all the same length 





For each pair of DNA sequences 


How many positions in the two sequences are identical 
as a fraction? 
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} 


Report the mean of the preceding calculations as a 
percentage 


Clearly, to write this code, you can reuse at least some of the work you've already done. 
You certainly know how to generate a set of random DNA sequences. Also, although you 
don't have a subroutine that compares, position by position, the bases in two sequences, 
you know how to look at the positions in DNA strings. So that subroutine shouldn't be 
hard to write. In fact, let's write some pseudocode that compares each nucleotide in one 
sequence with the nucleotide in the same position in another sequence: 


assuming DNA1l is the same length as DNA2, 
for each position from 1 to length (DNA) 


if the character at that position is the same in DNA 1 
and DNA 2 


++Scount 


} 


return count/length 
The whole problem now seems eminently do-able. You also have to write the code that 
picks each pair of sequences, collects the results, and finally takes the mean of the results 
and report it as a percentage. That can all go into the main program. Example 7-4 
gives it a try, all in one shot. 


Example 7-4. Calculate average % identity between pairs of random DNA 
sequences 


#!/usr/bin/perl 

# Calculate the average percentage of positions that are 
the same 

# between two random DNA sequences, in a set of 10 
sequences. 





use strict; 
use warnings; 


# Declare and initialize the variables 
my Spercent; 

my @percentages; 

my Sresult; 
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# An array, initialized to the empty list, to store the DNA 
in 
my @random_DNA = ( ); 


# Seed the random number generator. 

# time|$S combines the current time with the current 
process id 

srand(time|$$); 


# Generate the data set of 10 DNA sequences. 
@random_DNA = make random_DNA set( 10, 10, 10 ); 





# Iterate through all pairs of sequences 
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++8k) { 
for (my $i = ($k + 1) ; $1 < scalar @random_DNA ; ++$§1) 





{ 


# Calculate and save the matching percentage 
Spercent = matching percentage (S$random_ DNA[S$k], 
Srandom DNA[$i]); 
push (@percentages, Spercent) ; 
} 
} 


# Finally, the average result: 
Sresult = 0; 





foreach Spercent (@percentages) { 
Sresult += Spercent; 


} 

















Sresult = Sresult / scalar (@percentages) ; 

#Turn result into a true percentage 

Sresult = int (Sresult * 100); 

print "In this run of the experiment, the average 
percentage of \n"; 

print “matching positions is Sresult?\n\n"; 


exit; 


Hat H Ht HH Ht HH HH HH aE HH HH aE EH OE aE EH OE EE HE EE HE EE HOHE aE aE OE EE EE EH OE EE 
Hat HH HH HH HH HE HH HE EH 

# Subroutines 

Hat H Ht HH Ht HH HH HH aE EH HE aE EH OE EE HE EE HE EE EE aE EH OE aE HE OE EE HO EH EE EE 
Hat HH Ht Ht HH HH HH HH HH HE EH 
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# matching percentage 


it 





# Subroutine to calculate the percentage of identical bases 


in two 


# equal length DNA sequences 


sub matching percent 
my(Sstringl, $s 


# we assume tha 


my(Slength) = 
my (Sposition); 


my(Scount) = 0; 


for (Sposition 


++Scount; 


} 
} 





fe 





=0; 


tage { 


tring2) = @ ; 


the strings have the same length 





length (Sstringl); 


Sposition < Slength ; ++Sposition) 
if (substr (Sstring!, $position, 1) 
substr(Sstring2,Sposition,1)) { 


return $count / Slength; 


} 


# make random DNA set 


eq 


# 

# Subroutine to make a set of random DNA 

# 

# Accept parameters setting the maximum and minimum 
length of 

# each string of DNA, and the number of DNA strings to 
make 

# 


# WARNING: make sure you call srand to seed the 


# random number generator before you call this function. 


sub make _random_DNA set { 


# Collect arguments, 


my(Sminimum_ length, Smaximum_length, 


# length of each DNA fragment 


my Slength; 


# DNA fragment 
my Sdna; 
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# set of DNA fragments 
my @set; 


# Create set of random DNA 
ioe (my ol = OF. $l = Seiee Of Ser - rel) 4 


# find a random length between min and max 
Slength = randomlength (Sminimum_ length, 
Smaximum length) ; 











# make a random DNA fragment 
Sdna = make random DNA ( S$length ); 


# add $dna fragment to @set 
push( @set, Sdna ); 
} 


return @set; 


} 
randomlength 


# 

# 

# A subroutine that will pick a random number from 
# Sminlength to Smaxlength, inclusive. 
# 
# 
# 





WARNING: make sure you call srand to seed the 
random number generator before you call this function. 


sub randomlength { 


# Collect arguments, declare variables 
my(Sminlength, Smaxlength) = @ ; 





# Calculate and return a random number within the 

# desired interval. 

# Notice how we need to add one to make the endpoints 
inclusive, 

# and how we first subtract, then add back, Sminlength 
to 

# get the random number in the correct interval. 

return ( int(rand($maxlength - Sminlength + 1)) + 
Sminlength ); 
} 


# make random DNA 


it 
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# Make a string of random DNA of specified length. 
# 
# WARNING: make sure you call srand to seed the 


# random number generator before you call this function. 


sub make _random_DNA { 


# Collect arguments, declare variables 
my(Slength) = @ ; 





my Sdna; 


for (my Si=0 ; Si < Slength ; ++$1) { 
Sdna .= randomnucleotide( ); 





} 


return Sdna; 


} 
# randomnucleotide 
# 
# Select at random one of the four nucleotides 
# 
# WARNING: make sure you call srand to seed the 
# 
sub randomnucleotide { 
my (@nucleotides) = ('A', 'C', 'G', 'T'); 
# scalar returns the size of an array. 
# The elements of the array are numbered 0 to size-1l 
return randomelement (@nucleotides) ; 
} 
# randomelement 
# 
# randomly select an element from an array 
# 
# WARNING: make sure you call srand to seed the 
# 
sub randomelement {f 
my(@array) = @ ; 


return Sarray[rand @array]; 
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} 


If the code in Example 7-4 seems somewhat repetitive of code from previous 
examples, it is. In the interest of presentation, I included the subroutine code in the 
program. (You'll start using modules in Chapter 8 as a way to avoid this repetition.) 


Here's the output of Example 7-4: 


In this run of the experiment, the average number of 
matching positions is 0.24% 





Well, that seems reasonable. You might say, it's obvious: a quarter of the positions match, 
and there are four bases. But the point isn't to verify elementary probability, it's to show 
you have enough programming under your belt to write some programs that ask and 
answer questions about DNA sequences. 


7.5.1 Some Notes About the Code 


Notice in the main program that when it calls: 


@random_DNA = make random_DNA set( 10, 10, 10 ); 

you don't need to declare and initialize variables such as $minimum_ length. You can 
just fill in the actual numbers when you call the subroutine. (However it's often a good 
idea to put such things in variables declared at the top of the program, where it's easy to 
find and change them.) Here, you set the maximum and minimum lengths to 10 and ask 
for 10 sequences. 


Let's restate the problem we just solved. You have to compare all pairs of DNA, and for 
each pair, calculate the percentage of positions that have the same nucleotides. Then, you 
have to take the mean of these percentages. 


Here's the code that accomplishes this in the main program of Example 7-4: 
# Iterate through all pairs of sequences 
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) { 
for (my $i = ($k + 1) ; $1 < scalar @random_DNA ; ++$§1) 


# Calculate and save the matching percentage 
Spercent = matching percentage ($random_ DNA[S$k], 
Srandom DNA[$i]); 
push (@percentages, Spercent); 
} 
} 


To look at each pair, you use a nested loop. A nested loop is simply a loop within another 
loop. These are fairly common in programming but must be handled with care. They may 
seem a little complex; take some time to see how the nested loop works, because it's 
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common to have to select all combinations of two (or more) elements from a set. 


The nested loop involves looking at (n * (n-1)) / 2 pairs of sequences, which is a square 
function of the size of the data set. This can get very big! Try gradually increasing the 
size of the data set and rerunning the program, and you'll see your compute time increase, 
and more than gradually. 


See how the looping works? First sequence 0 (indexed by $K) is paired with sequences 
1,2,3,...,9, in turn (indexed by $i). Then sequence | is paired with 2,3,...,9, etc. Finally, 8 
is paired with 9. (Recall that array elements are numbered starting at 0, so the last 
element of an array with 10 elements is numbered 9. Also recall that scalar 
@random DNA returns the number of elements in the array.) 


You might find it a worthwhile exercise to let the number of sequences be some small 
value, say 3 or 4, and think through (paper and pencil in hand) how the nested loops and 
the variables $k and $i evolve during the running of the program. Or you can use the 
Perl debugger to watch how it happens. 


7.6 Exercises 


Exercise 7.1 


Write a program that asks you to pick an amino acid and then keeps (randomly) 
guessing which amino acid you picked. 


Exercise 7.2 


Write a program that picks one of the four nucleotides and then keeps prompting 
until you correctly guess the nucleotide it picked. 


Exercise 7.3 


Write a subroutine to randomly shuffle the elements of an array. The subroutine 
should take an array as an argument and return an array with the same elements 
but shuffled in a random order. Each element of the original array should appear 
exactly once in the output array, just like shuffling a deck of cards. 


Exercise 7.4 


Write a program to mutate protein sequence, similar to the code in Example 7- 
2 that mutates DNA. 


Exercise 7.5 


Write a subroutine that, given a codon (a fragment of DNA of length 3), returns a 
random mutation in the codon. 


Exercise 7.6 


Some versions of Perl automatically seed the random number generator, making it 
superfluous to call srand for that purpose before using rand to generate 
random numbers. Experiment to see if your implementation of rand calls 
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srand automatically, or if you have to explicitly call srand yourself, as you 
have seen done in the code in this chapter. 


Exercise 7.7 


Sometimes not all choices are will be picked in a random selection. Write a 
subroutine that randomly returns a nucleotide, in which the probability of each 
nucleotide can be specified. Pass the subroutine four numbers as arguments, 
representing the probabilities of each nucleotide; if each probability is 0.25, the 
subroutine is equally likely to pick each nucleotide. As error checking, have the 
subroutine ensure that the sum of the four probabilities is 1. 

Hint: one way to accomplish this is to divide the range between 0 and 1 into four 
intervals with lengths corresponding to the probability of the respective 
nucleotides. Then, simply pick a random number between 0 and 1, see in which 
interval it falls, and return the corresponding nucleotide. 


Exercise 7.8 


This is a more difficult exercise. The study function in Perl may speed 
up searches for motifs in DNA or protein. Read the Perl documentation on this 
function. Its use is simple: given some sequence data in a variable Ssequence, 
type: 

study Ssequence; 

before doing the searches. Do you think study will speed up searches in DNA or 
protein, based on what you've read about it in the documentation? 


For lots of extra credit! Now read the Perl documentation on the standard module 
Benchmark. (Type perldoc Benchmark, or visit the Perl home page at 


http://www.perl.com.) See if your guess is right by writing a program that 
benchmarks motif searches of DNA and of protein, with and without study. 
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Chapter 8. The Genetic Code 


Up to this point we've used Perl to search for motifs, simulate DNA mutations, generate 
random sequences, and transcribe DNA to RNA. These are all important activities, and 
they serve as a good introduction to the computational techniques you can use to study 
biological systems. 


In this chapter, we'll write Perl programs to simulate how the genetic code directs the 
translation of DNA into protein. I will start by introducing the hash datatype. Then, after 
a brief discussion of how different data structures (hashes, arrays, and databases) can 
store and access experimental information, we will write a program to translate DNA to 
protein. We'll also continue exploring regular expressions and write code to handle 
FASTA files. 


8.1 Hashes 


There are three main datatypes in Perl. You've already seen two: scalar variables and 
arrays. Now we'll start to use the third: hashes (also called associative arrays). 


A hash provides very fast lookup of the value associated with a key. As an example, say 
you have a hash called english dictionary. (Yes, hashes start with the percent 
sign.) If you want to look up the definition of the word "recreant," you say: 


Sdefinition = Senglish_dictionary{'recreant'}; 

The scalar 'recreant' is the key, and the scalar definition that's returned is the value. 
As you see from this example, hashes (like arrays) change their leading character to a 
dollar sign when you access a single element, because the value returned from a hash 
lookup is a scalar value. You can tell a hash lookup from an array element by the type of 
braces they use: arrays use square brackets [ ]; hashes use curly braces { }. 


If you want to assign a value to a key, it's similarly an easy, single statement: 


Senglish cictionary({*recreant’} = “One who calls out in 
surrender,” } 


Also, if you want to initialize a hash with some key-value pairs, it's done much like 
initializing arrays, but every pair becomes a key-value: 


Sclassification = ( 








"AOO." y "mammal', 
'robin"; "bara', 
"asp ; ‘reptile', 


i 
which initializes the key 'dog' with the value 'mammal', and so on. There's another 
way of writing this, which shows what's happening a little more clearly. The following 
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does exactly the same thing as the preceding code, while showing the key-value 
relationship more clearly: 
6classification = ( 








"dog' => 'mammal', 
‘robin’ => 'bird', 
‘asp', => 'reptile', 





; 

You can get an array of all the keys of a hash: 
@keys = keys *smy_hash; 

You can get an array of all the values of a hash: 
@values = values %my_ hash; 


You use hashes in lots of different situations, especially when your data is in the form of 
key-value or you need to look up the value of a key fast. For instance, later in this chapter, 
we'll develop programs that use hashes to retrieve information about a gene. The gene 
name is the key; the information about the gene is the value of that key. Mathematically, 
a Perl hash always represents a finite function. 


The name "hash" comes from something called a hash function, which practically any 
book on algorithms will define, if you've a mind to look it up. Let's skip the details of 
how they work under the hood and just talk about their behavior. 


8.2 Data Structures and Algorithms for Biology 


Biologists explore biological data and try to figure out how to do things with it based on 
its existing structure in living systems. Bioinformatics is often used to model that existing 
structure as closely as possible. (Bear with me; I'm speaking in generalities!) 


Bioinformatics also can take a slightly different approach. It thinks about what it wants to 
do with the data and then tries to figure out how to organize it to accomplish that goal. In 
other words, it tries to produce an algorithm by representing the data in a convenient data 
structure. 


Now that you've got the three datatypes of Perl in hand—namely scalars, arrays, and 
hashes—it's time to take a look at these interrelated topics of algorithms and data 
structures. We've already talked about algorithms in Chapter 3. The present discussion 
highlights the importance of the organization of the data for algorithms, in other words, 
the data structures for the algorithm. 


The most important point here is that different algorithms often require different data 
structures. 


8.2.1 A Gene Expression Database 
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Let's consider a typical problem. Say you're studying an organism that has a total of about 
30,000 genes. (Yep, you're right, it's human.) Say you're looking at a type of cell that's 
never been well characterized under certain interesting environmental conditions, and 
you are determining, for each gene, whether it's being expressed.“! You have a nice 
microarray facility that has given you the expression information for that cell. Now, for 
each gene, you need to look up whether it's expressed in the cell. You have to put this 
look-up capability on your web site, so visitors who read your results in your upcoming 
paper can find the expression data for the genes. 


[1 For the nonbiologists: a gene is expressed when it is transcribed into RNA, so that a protein 
can be made from it. 


There are several ways to proceed. Let's look at a few alternatives as a short and gentle 
introduction to the art and science of algorithms and data structures. 


What is your data? For simplicity, let's say you have the names for all the genes in the 
organism and a number for the expressed genes indicating the level of the expression in 
your experiment; the unexpressed genes have the number 0. 


8.2.2 Gene Expression Data Using Unsorted Arrays 


Now let's suppose you want to know if the genes were expressed, but not the expression 
levels, and you want to solve this programming problem using arrays. After all, you are 
somewhat familiar with arrays by this point. How do you proceed? 


You might store in the array only the names of the genes that are being expressed and 
discard the other gene names. Say there were 8,000 expressed genes. Then, for any query, 
the answer requires looking through the array and comparing the query with each gene in 
the array until either you find it or get to the end of the array without finding it. 


That works, but there are problems. Mainly, it's kind of slow. This isn't bad if you just do 
it now and then, but if you've got a lot of people hitting your web site asking questions 
about this new expression data, it can be a problem. On average, a lookup for an 
expressed gene requires looking through 4,000 gene names. A lookup for an unexpressed 
gene takes 8,000 comparisons. 


Also, if someone asked about a gene missing from your study, you couldn't respond, 
since you discarded the unexpressed gene names. The query gives a negative response, 
not an error message saying the gene being searched for isn't part of your experiment. 
This might even be a false negative if the query gene that wasn't part of your study 
actually is expressed in the cell type (but you just missed it). You'd prefer it if your 
program would report to the user that no gene by that name was studied. 


So you decide to keep all 30,000 genes in the array. (Of course, now a search will be 
slower.) But how to distinguish the expressed from the unexpressed genes? You can load 
each gene's name into the array and then append the expression measurement after the 
name of each gene. Then you will definitely know if a gene is missing from your 
experiment. 
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However, the program is still a bit slow. You still have to search through the entire array 
until you find the gene or determine that it wasn't studied. You may find it right away if 
it's the first element in the array, or you may have to wait until the last element. On 
average, you have to search through half of the array. Plus, you have to compare the 
name of the searched-for gene with the names of the genes in the array one by one. It will 
average 15,000 comparisons per query: slow. (Actually, on a modern computer, not too 
horribly slow, really, but I'm making a point. These sorts of things do add up with a 
program that runs too slowly.) 


Another problem is that you're now keeping two values in one scalar: the gene name and 
the expression measurement. To do anything with this data, you have to also separate the 
gene name from the measurement of the expression of the gene. 


Despite these drawbacks, this method will work. Now, let's think about alternatives. 


8.2.3 Gene Expression Data Using Sorted Arrays and Binary 
Search 


You might try storing all the gene names in alphabetical order in the array and then use 
the following search technique. First, look at the middle element. (You can tell the size of 
the array, as we've seen, with the expression scalar @array). If your gene name is 
before that middle element alphabetically, you ignore the second half of the array and 
pick the middle element of the remaining half of the array. You continue, at each step 
narrowing the search to half the previous number of elements, until you find a match or 
discover there is none. Here it is in pseudocode: 

Given a sorted array, and an element: 


Until you find the element or discover it's not there, 
Pick the midpoint of the array, Sarray[scalar(@array) /2] 
Compare your element with the element at the midpoint 
If that matches your element, you're done. 


Else, ignore the half of the array that your element is 
not in 
} 
To compare two strings alphabetically in Perl, you use the CMP operator, which returns 0 
if the two strings are the same, -1 if they are in alphabetical order, and 1 if they are in 
reverse alphabetical order. For example, the following returns 0: 
'Z2Z22Z' cmp 'Z222Z'; 


This returns -1: 
"AAA' cmp 'Z22ZZ'; 
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Finally, this returns 1: 
'ZZ22Z' cmp 'AAA'; 


This algorithm is called a binary search, and it considerably speeds up the process of 
searching in an array, for example, to search 30,000 genes takes only about 15 times 
through the loop, maximum. (As compared to 15,000 comparisons, average, for the 
unsorted array.) Of course, you also have to sort the list, which might take awhile. If you 
need to keep adding elements, you have to either insert them in the right place or add 
them to the end and sort the array again. All that inserting or sorting might slow things 
down considerably. But if you're just sorting it once and then doing lots of lookups, a 
binary search might be worth doing. 


While we're at it, let's look at how to sort an array. Here's how to sort an array of strings 
alphabetically: 


@array = sort @array; 
Here's how to sort an array of numbers in ascending order: 


@array = sort { Sa <=> Sb } @array; 
Many other kinds of sorting can be done, but these are the most common. For more 
details, see the Perl documentation for the sort function. 


8.2.4 Gene Expression Data Using Hashes 


You can also use hashes to find a gene in your data. To do so, you can load the hash so 
that the keys are the gene names and the values are the expression measurement. Then a 
single call on the hash, with the name of the desired gene as a key, returns the results of 
the experiment for that gene, and you've got your answer. This process is also cleaner 
than storing the gene name and the expression result in one scalar string; here the key is a 
scalar, and the value is a separate scalar. 


Furthermore, due to how hashes are made, you get an answer back very quickly, because 
decent hashes don't have to search hard to find the value of a key. Using hashes is 
typically faster than binary searches. Plus, you'd know if the gene being searched for was 
in the data, because you can explicitly ask if a hash value is defined by saying something 
like: 


if( defined Smyhash{'mykey'} ) { ... } 


Also, you'll get an error message if you have warnings turned on, and you refer to an 
undefined value. 


Another advantage of hashes over binary searching is that you can add or subtract 
elements to hashes without resorting the entire array. 


Finally, because hashes are built into Perl as a basic datatype, they are easy to use, and 
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you won't have to do much programming to accomplish your goal. It is usually the case 
that it's more important to save time writing a program then it is to save time running it. I 
mention this in Chapter 3, but it's worth emphasizing. To a programmer, the lazy way 
is often the most efficient way: let the machine do the work! 


Don't get the idea that hashes are always the right way to go, however. For instance, they 
don't store their elements in a sorted order, so if you need to look at the data that way, 
you have to explicitly sort it, like so: 


@sorted keys = sort keys %*my_hash; 


This is do-able, but it can be a bit slow on a large array. (You could also sort the values, 
of course.) 


To conclude the discussion of data structures for our expression data example, here's an 
informal survey of the properties of some different data structures in Perl for searching, 
adding and deleting, and maintaining sorted order in a set of gene names: 


Use a hash if you just need to see if something is in a set and don't need to list the set in 
order. 


A sorted array combined with a binary search algorithm will do if you need an ordered 
set and pretty fast lookup and don't need to add or subtract elements very often. 


An array, in conjunction with the Perl functions push and pop, works well if you don't 
need to sort the elements but do need to quickly get at the most recently added element. 


A Perl array with the functions push and shift will serve if you don't need the elements 
sorted but need to add elements. It's especially useful to always remove the "oldest" 
element (the element that has been in the array the longest). 


For more information, see Appendix A and especially Mastering Algorithms with Perl 
(published by O'Reilly). 


8.2.5 Relational Databases 


Databases are programs that store and retrieve large amounts of data. They provide the 
most common forms of datatypes to use in algorithms. There are several popular 
databases. Some good ones that are free of charge (the best ones are very expensive), and 
Perl provides access to all the most popular ones. The Perl/DBI modules, for instance, 
provide convenient access to relational databases from Perl programs. 


Most databases are called relational, which describes how they store data. Another 
common name for these types of databases is relational database management systems, or 
RDMS. 


Relational databases store data organized in tables. The data is usually entered and 
extracted with a query language called Structured Query Language , or SQL, which is a 
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fairly simple language for accessing the data in the tables and following links between the 
tables. 


Relational databases are the most popular way to store and retrieve large amounts of data, 
but they do require a fair bit of learning. Programming with relational databases is 
beyond the scope of this book, but if you end up doing a lot of programming with Perl, 
you'll find that knowing the basics of using a database is a valuable skill. See the 
discussion in Chapter 13. 


In particular, it's perfectly reasonable to store your gene expression data in a relational 
database and use that in your program to respond to queries made on your web site. 


8.2.6 DBM 


Perl has a simple, built-in way to store hash data, called database management (DBM). 
It's simple to use: after starting up, it "ties" a hash to a file on your computer disk, so you 
can save a hash to reuse at a later date. This is, in effect, a simple (and very useful) 
database. Apart from the initialization, you use it as you would any other hash. You can 
store your genes and expression data in a DBM file and then use it as a hash. There's 


more on DBM in Chapter 10 


8.3 The Genetic Code 


The genetic code is how a cell translates the information contained in its DNA into amino 
acids and then proteins, which do the real work in the cell. 


8.3.1 Background 


Herein is a short introduction for the nonbiologists. 


As stated earlier, DNA encodes the primary structure (i.e., the amino acid sequence) of 
proteins. DNA has four nucleotides, and proteins have 20 amino acids. The encoding 
works by taking each group of three nucleotides from the DNA and "translating" them to 
an amino acid or a stop signal. Each group of three nucleotides is called a codon. We'll 
see in detail how this coding and translation works. 


Actually, transcription first uses DNA to make RNA, and then translation uses RNA to 
make proteins. This is called the central dogma of molecular biology. But in this course, 
I'll abbreviate the process and somewhat inaccurately call the entire process from DNA to 
protein "translation." 


The reason for this cavalier distinction is that the whole business is much easier to 
simulate on computer using strings to represent the DNA, RNA, and proteins. In fact, as 
shown in Chapter 4, transcribing DNA to RNA is very easy indeed. In your computer 
simulations, you can simply skip that step, since it's just a matter of changing one letter to 
another. (The actual process in the cell, of course, is much more complex.) 
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Note that with four kinds of bases, each group of three bases of DNA can represent as 
many as 4 x 4 x 4 = 64 possible amino acids. Since there are only 20 amino acids plus a 
stop signal, the genetic code has evolved some redundancy, so that some amino acids are 
represented by more than one codon. Every possible three bases of DNA—each codon— 
represents some amino acid (apart from the three codons that represent a stop signal). 


The chart in Figure 8-1 shows how the various bases combine to form amino acids. 
There are many interesting things to note about the genetic code. For our purposes, the 
most important is redundancy—the way more than one codon translates to the same 
amino acid. We'll program this using character classes and regular expressions, as you'll 
soon see.2! 


2] Also note that the genetic code in Figure 8-1 is properly based on RNA, where uracil 
appears instead of thymine. In our programs, we're going to go directly from DNA to amino 
acids, so our codons will use thymine instead of uracil. 


Figure 8-1. The genetic code 


Second Position 
U C A G 
WU p,, _Ueu WW y, UU g wu 
. UUC UC UAC UGC C 
WA 1, Ua UAA Stop ~=ss UGA—s«‘Stop = 
UUG UCG UAG Stop UGG Ip G 
uu ca CAU ss (GU U 
CUC cc a) ” | CC C 
Cc Leu Pro Arg 
5 CUA (CA Og, GA As 
Ss © UG (CG CAG (6G 6 3 
a Qa 
E AUU ACU MU 4, AU og OU = 
, Mu le ACC AAC AGC C 
AUA ACA MA GA A 
AUG Met (start) ACG MG OU”COGG YG 
GUU Gu GAU Asp GGU U 
G Ca 66C bly C 
GUA GCA GAA, Gi A 
GUG GCG GAG G66 G 


The machinery of the cell actually starts at some point along the RNA and "reads" the 
sequences codon after codon, attaching the encoded amino acid to the end of the growing 
protein sequence. Example 8-1 simulates this, reading the string of DNA three bases at 
a time and concatenating the symbol for the encoded amino acid to the end of the 
growing protein string. In the cell, the process stops when a codon is encountered. 


8.3.2 Translating Codons to Amino Acids 
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The first task is to enable the following programs to do the translation from the three- 
nucleotide codons to the amino acids. This is the most important step in implementing the 
genetic code, which is the encoding of amino acids by three-nucleotide codons. 


Here's a subroutine that returns an amino acid (represented by a one-letter abbreviation) 
given a three-letter DNA codon: 


# codon2aa 

# 

# A subroutine to translate a DNA 3-character codon to an 
amino acid 


sub codonZ2aa { 






































my($S$codon) = @ ; 
if ( $codon =~ /TCA/i ) { return 'S' } # 

Serine 

elsif ( Scodon =~ /TCC/i ) { return 'S' } # 
Serine 

elsif ( $codon =~ /TCG/i ) { return 6! } # 
Serine 

elsif ( $codon =~ /TCT/i ) { return 'S' } 7 
Serine 

elsif ( Scodon =~ /TTC/1 ) { réturn *E" } # 
Phenylalanine 

elsif ( Scodon =~ /TTT/1i ) { return 'F' } # 
Phenylalanine 

elsif ( Scodon =~ /TTA/1 ) { return 'L' } # 
Leucine 

elsif ( Scodon =~ /TTG/1 ) { return *ie 4 # 
Leucine 

elsif ( Scodon =~ /TAC/i ) { return 'Y"' } # 
Tyrosine 

elsif ( Scodon =~ /TAT/1 ) { return 'Y' } # 
Tyrosine 

elsif ( $codon =~ /TAA/i ) { return ' ' } # Stop 

elsif ( $codon =~ /TAG/i ) { return * * } # Stop 

elsif ( $codon =~ /TGC/i ) { veturn *C! } fi 
Cysteine 

elsif ( Scodon =~ /TGT/i ) { return 'C' } # 
Cysteine 

elsif ( $codon =~ /TGA/i ) 1 veturn * * } # Stop 

elsif ( Scodon =~ /TGG/i ) { return 'W' } # 
Tryptophan 

elsif ( Scodon =~ /CTA/i ) { return 'L' } # 
Leucine 
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elsif ( Scodon =~ /CTC/i ) { return 'L' } # 
Leucine 

elsif ( Scodon =~ /CTG/i ) { return 'L' } # 
Leucine 

elsif ( Scodon =~ /CTT/i ) { return 'L' } # 
Leucine 

elsif ( Scodon =~ /CCA/i ) { return 'P' } # 
Proline 

elsif ( Scodon =~ /CCC/i ) { return 'P' } # 
Proline 

elsif ( $codon =~ /CCG/i ) { return ‘RP’ } # 
Proline 

elsif ( Scodon =~ /CCT/i ) { return 'P' } # 
Proline 

elsif ( $codon =~ /CAC/i ) { return 'H' } 
Histidine 

elsif ( Scodon =~ /CAT/i ) { return 'H' } # 
Histidine 

elsif ( Scodon =~ /CAA/i ) { return 'Q' } # 
Glutamine 

elsif ( $codon =~ /CAG/i ) { return 'O' } # 
Glutamine 

elsif ( Scodon =~ /CGA/i ) { return 'R' } # 
Arginine 

elsif ( Scodon =~ /CGC/i ) { return 'R' } # 
Arginine 

elsif ( $codon =~ /CGG/i ) { eturn "R' } fi 
Arginine 

elsif ( Scodon =~ /CGT/i ) { return 'R' } # 
Arginine 

elsif ( Scodon =~ /ATA/1i ) { return 71" 4 # 
Tsoleucine 

elsif ( Scodon =~ /ATC/1 ) { wetutn 7" 4 # 
Tsoleucine 

elsif ( Scodon =~ /ATT/1 ) { return 'I' } # 
Tsoleucine 

elsif ( Scodon =~ /ATG/i ) { return 'M' } # 
Methionine 

elsif ( Scodon =~ /ACA/i ) { return 'T' } # 
Threonine 

elsif ( Scodon =~ /ACC/i ) { return 'T' } # 
Threonine 

elsif ( $codon =~ /ACG/i ) f veturn rT" 7 # 
Threonine 

elsif ( Scodon =~ /ACT/i ) { return 'T' } # 
Threonine 
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elsif ( S$codon 
Asparagine 

elsif ( S$codon 
Asparagine 

elsif ( S$codon 
Lysine 

elsif ( S$codon 
Lysine 

elsif ( S$codon 
Serine 

elsif ( S$codon 
Serine 

elsif ( $codon 
Arginine 

elsif ( S$codon 
Arginine 

elsif ( S$codon 
Valine 

elsif ( S$codon 
Valine 

elsif ( S$codon 
Valine 

elsif ( S$codon 
Valine 

elsif ( S$Scodon 
Alanine 

elsif ( S$codon 
Alanine 

elsif ( $codon 
Alanine 

elsif ( $codon 
Alanine 

elsif ( S$codon 
Aspartic Acid 

elsif ( S$codon 
Aspartic Acid 

elsif ( S$codon 
Glutamic Acid 

elsif ( S$codon 
Glutamic Acid 

elsif ( $codon 
Glycine 

elsif ( S$codon 
Glycine 

elsif ( S$codon 
Glycine 





IT-SC 


/BAC/i 
/BAT/i 
/BRA/i 
/RAG/i 
/AGC/i 
/AGT/i 
/AGA/i 
/AGG/i 
/GTA/i 
/GTC/i 


PETE / i 





/CTT/ 1. 
/GCA/i 
/GCC/i 
/GCG/i 
/GCT/i 
/GAC/i 
/GAT/i 
/GAA/i 
/GAG/i 
/GGA/i 
/GGC/i 
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elsif ( $codon =~ /GGT/i ) { return 'G' } # 
Glycine 
else { 
print STDERR “Bad codon \"Scodon\"! !\n"; 
exit; 


} 


This code is clear and simple, and the layout makes it obvious what's happening. 
However, it can take a while to run. For instance, given the codon GGT for glycine, it has 
to check each test until it finally succeeds on the last one, and that's a lot of string 
comparisons. Still, the code achieves its purpose. 


There's something new happening in the code's error message. Recall filehandles from 
Chapter 4 and how they access data in files. From Chapter 5, remember the special 
filehandle STDIN that reads user input from the keyboard. STDOUT and STDERR are 
also special filehandles that are always available to Perl programs. STDOUT directs 
output to the screen (usually) or another standard place. When a filehandle is missing 
from a print statement, STDOUT is assumed. The print statement accepts a filehandle 
as an optional argument, but so far, we've been printing to the default STDOUT. Here, 
error messages are directed to STDERR, which usually prints to the screen, but on many 
computer systems they can be directed to a special error file or other location. 
Alternatively, you sometimes want to direct STDOUT to a file or elsewhere but want 
STDERR error messages to appear on your screen. I mention these options because you 
are likely to come across them in Perl code; we don't use them much in this book (see 
Appendix B for more information). 


8.3.3 The Redundancy of the Genetic Code 


I've remarked on the redundancy of the genetic code, and the last subroutine clearly 
displays this redundancy. It might be interesting to express that in your subroutine. 
Notice that groups of redundant codons almost always have the same first and second 
bases and vary in the third. You've used character classes in regular expressions to match 
any of a set of characters. Now, let's try to redo the subroutine to make one test for each 
redundant group of codons: 


# codon2aa 

# 

# A subroutine to translate a DNA 3-character codon to an 
amino acid 

# Version 2 


sub codonZ2aa { 
my(Scodon) = @ ; 


if ( Scodon =~ /GC./i) { return 'A' } # 
Alanine 
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elsif ( $codon =~ /TG[TC]/i) { return "Cc" } # 
Cysteine 

elsif ( Scodon =~ /GA[TC]/1i) { return 'D' } # 
Aspartic Acid 

elsif ( Scodon =~ /GA[AG]/1i) { return "EH" } 7 
Glutamic Acid 

elsif ( $codon =~ /TT[TC]/i) { return “EF? 4} # 
Phenylalanine 

elsif ( Scodon =~ /GG./i) { return 'G' } # 
Glycine 

elsif ( Scodon =~ /CA[TC]/1i) { return 'H’ } fi 
Histidine 

elsif ( Scodon =~ /AT[TCA]/1i) { return 'I' } # 
Tsoleucine 

elsif ( $codon =~ /AA[AG]/i) { return 'K' } 7 
Lysine 

elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } 7 
Leucine 

elsif ( Scodon =~ /ATG/i) { return 'M' } # 
Methionine 

elsif ( $codon =~ /AA[TC]/i) { return 'N' } 7 
Asparagine 

elsif ( Scodon =~ /CC./i) { return 'P' } # 
Proline 

elsif ( Scodon =~ /CA[AG]/1i) { return 'Q' } # 
Glutamine 

elsif ( Scodon =~ /CG.|AG[AG]/i) { return 'R' } # 
Arginine 

elsif ( S$codon =~ /TC.|AG[TC]/i) { return 'S"' } # 
Serine 

elsif ( Scodon =~ /AC./i) { return: “2 4 # 
Threonine 

elsif ( $codon =~ /GT./1i) { return 'V" } tt 
Valine 

elsif ( Scodon =~ /TGG/i) { return 'W' } # 
Tryptophan 

elsif ( $codon =~ /TA[TC] /i) i yeturn ty? 4} 7 
Tyrosine 

elsif ( $codon =~ /TA[AG]|TGA/i) { return ' ' } it 
Stop 

else { 

Print STDERR "Bad codon \"Sscodon\"!!\n"; 
exit; 


} 


Using character classes and regular expressions, this code clearly shows the redundancy 
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of the genetic code. Also notice that the one-character codes for the amino acids are now 
in alphabetical order. 


A character class such as [TC] matches a single character, either T or C. The . is the 
regular expression that matches any character except a newline. The /GT./1i expression 
for valine matches GTA, GTC, GTG, and GTT, all of which are codons for valine. (Of 
course, the period matches any other character, but the $codon is assumed to have only 
A,C,G, or T characters.) The i after the regular expression means match uppercase or 
lowercase, for instance /T/i matches T ort. 


The new feature in these regular expressions is the use of the vertical bar or pipe (|) to 
separate two choices. Thus for serine, /TC. |AG[TC]/ matches /TC./ or /AG[TC] /. 
In this program, you need only two choices per regular expression, but you can use as 
many vertical bars as you like. 


You can also group parts of a regular expression in parentheses, and 
use vertical bars in them. For example, /give me a (break|meal) / 
matches "give me a break" or "give me a meal." 


8.3.4 Using Hashes for the Genetic Code 


If you think about using a hash for this translation, you'll see it's a natural way to proceed. 
For each codon key the amino acid value is returned. Here's the code: 


# 

# codon2aa 

# 

# A subroutine to translate a DNA 3-character codon to an 
amino acid 

# Version 3, using hash lookup 


sub codonZ2aa { 
my($S$codon) = @ ; 


Scodon = uc Scodon; 


my (sqenetic code) = ( 

















'TCA' => 'S', # Serine 

MECC! =e Vor, # Serine 

TICE” Se NG', # Serine 

'mICTrY => TS", # Serine 

Trl’ => *P*, # Phenylalanine 
‘TTT’ => *PY, # Phenylalanine 
nA a Nh # Leucine 

Me: NT # Leucine 

'TAC' => 'Y', # Tyrosine 
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“TAT* 
'TAA' 
'TAG' 
"TGC" 
'TGT' 
'TGA' 
'TGG' 
"CTA" 
"CTC" 
'CTG' 
"CLIT 
"CCA' 
"CCC" 
"CCE" 
"CCT" 
"CAC! 
"CAT! 
"CAA' 
"CAG" 
"CGA" 
eae 
"CGG' 
"CGT" 
‘ATA' 
"ATC" 
‘ATT! 
‘ATG' 
"ACA' 
"ACC! 
"ACG' 
‘ACT! 
"AAC! 
"AAT! 
"AAA' 
"AAG" 
"AGC! 
‘AGT! 
"AGA' 
"AGG' 
'GTA' 
'GTC' 
'GTG' 
'GTT' 
"GCA' 
NeCe™ 
"GCG' 
"GCT" 
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Se OSE SHE SE SHE SHE HEHEHE SE HEHE SE SHE SE HEHEHE EHH SE SESH HE SHE HEHEHE HE SE HE SHH HEHEHE HEH SH 


Tyrosine 
Stop 

stop 
Cysteine 
Cysteine 
Stop 
Tryptophan 
Leucine 
Leucin 
Leucin 
Leucin 
Prolin 
Prolin 
Prolin 
Proline 
Histidine 
Histidine 
Glutamine 
Glutamine 
Arginine 
Arginine 
Arginine 
Arginine 
Tsoleucine 
Tsoleucine 
soleucine 
Methionine 
Threonine 
Threonine 
Threonine 
Threonine 
Asparagine 
Asparagine 
Lysine 
Lysine 
Serine 
Serine 
Arginine 
Arginine 
Valine 
Valine 
Valine 
Valine 
Alanine 
Alanine 
Alanine 
Alanine 


= 
= 
e 
e 
Cc 
e 
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1eACY =e. TH", 
'GAT' => 'D', 
'GAA* => TET, 


Aspartic Acid 
Aspartic Acid 














# 

7 

# Glutamic Acid 
'GAG' => 'E', # Glutamic Acid 
'GGA' => 'G', # Glycine 
"eECY Se Gt, # Glycine 
"GEG". =>: NG", # Glycine 
'GGT' => 'G', # Glycine 





ee 


if (exists Sqenetic code{Scodon}) { 
return Sgenetic code{$codon}; 
jelse{ 


print STDERR "Bad codon \"Scodon\"!!\n"; 
exit; 


} 


This subroutine is simple: it initializes a hash and then performs a single lookup of its 
single argument in the hash. The hash has 64 keys, one for each codon. 


Notice there's a function exists that returns true if the key $codon exists in the hash. 
It's equivalent to the e/Se statement in the two previous versions of the codon2aa 
subroutine. 


1 A key might exist in a hash, but its value can be undefined. The defined function checks for 
defined values. Also, of course, the value might be 0 or the empty string, in which case, it fails 
a test such as if (S$hash{$key}) because, even though the key exists and the value is defined, 
the value evaluates to false in a conditional test. 


Also notice that to make this subroutine work on lowercase DNA as well as uppercase, 
you translate the incoming argument into uppercase to match the data in the 
Sgenetic code hash. You can't give a regular expression to a hash as a key; it must 
be a simple scalar value, such as a string or a number, so the case translation must be 
done first. (Alternatively, you can make the hash twice as big.) Similarly, character 
classes don't work in the keys for hashes, so you have to specify each one of the 64 
codons individually. 


You may wonder why bother wrapping this last bit of code in a subroutine at all. Why not 
just declare and initialize the hash and do the lookups directly in the hash instead of going 
through the subroutine? Well, the subroutine does do a little bit of error checking for 
nonexistent keys, so having a subroutine saves doing that error checking yourself each 
time you use the hash. 


Additionally, wrapping the code in a subroutine gives a little insurance for the future. If 
all the code you write does codon translation by means of our subroutine, it would be 
simplicity itself to switch over to a new way of doing the translation. Perhaps a new kind 
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of datatype will be added to Perl in the future, or perhaps you want to do lookups from a 
database or a DBM file. Then all you have to do is change the internals of this one 
subroutine. As long as the interface to the subroutine remains the same—that is to say, as 
long as it still takes one codon as an argument and returns a one-character amino acid— 
you don't need to worry about how it accomplishes the translation from the standpoint of 
the rest of the programs. Our subroutine has become a black box. This is one 
significant benefit of modularization and organization of programs with subroutines. 


There's another good, and biological, reason why you should use a subroutine for the 
genetic code. There is actually more than one genetic code, because there are differences 
as to how DNA encodes amino acids among mammals, plants, insects, and yeast— 
especially in the mitochondria. So if you have modularized the genetic code, you can 
easily modify your program to work with a range of organisms. 


One of the benefits of hashes is that they are fast. Unfortunately, our subroutine declares 
the whole hash each time the subroutine is called, even for one lookup. This isn't so 
efficient; in fact, it's kind of slow. There are other, much faster ways that involve 
declaring the genetic code hash only once as a global variable, but they would take us a 
little far afield at this point. Our current version has the advantage of being easy to read. 
So, let's be officially happy with the hash version of codon2aa and put it into our 
module in the file BeginPerlBioinfo.pm (see Chapter 6). 


Now that we've got a satisfactory way to translate codons to amino acids, we'll start to 
use it in the next section and in the examples. 


8.4 Translating DNA into Proteins 


Example 8-1 shows how the new cOdon2aa subroutine translates a whole DNA 
sequence into protein. 


Example 8-1. Translate DNA into protein 


#!/usr/bin/perl 
# Translate DNA into protein 





use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Initialize variables 

my Sdna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC'; 
my Sprotein = ''; 

my Scodon; 


# Translate each three-base codon into an amino acid, and 
append to a protein 
for(my Si=0; Si < (length(S$dna) - 2) ; Si += 3) { 


IT-SC 185 


Scodon = substr($dna,$i,3); 
Sprotein .= codonZaa($codon) ; 


} 


print "I translated the DNA\n\nSdna\n\n into the 
protein\n\nSprotein\n\n"; 


exit; 


To make this work, you'll need the BeginPer!/Bioinfo.pm module for your subroutines in a 
separate file the program can find, as discussed in Chapter 6. You also have to add the 
codon2aa subroutine to it. Alternatively, you can add the code for the subroutine 
condon2aa directly to the program in Example 8-1 and remove the reference to the 
BeginPer|Bioinfo.pm module. 


Here's the output from Example 8-1: 


I translated the DNA 
CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC 
into the protein 


RRLRTGLARVGR 

You've seen all the elements in Example 8-1 before, except for the way it loops 
through the DNA with this statement: 

for(my $i=0; $i < (length(S$dna) - 2) ; $i += 3) { 

Recall that a for loop has three parts, delimited by the two semicolons. The first part 
initializes a counter: my $i=0 statically scopes the $i variable so it's visible only inside 
this block, and any other $i elsewhere in the code (well, in this case, there aren't any, but 
it can happen) is now invisible inside the block. The third part of the for loop 
increments the counter after all the statements in the block are executed and before 
returning to the beginning of the loop: 

Si += 3 


Since you're trying to march through the DNA three bases at a shot, you increment by 
three. 


The second, middle part of the for loop tests whether the loop should continue: 

Si < (length(S$dna) - 2) 

The point is that if there are none, one, or two bases left, you should quit, because there's 
not enough to make a codon. Now, the positions in a string of DNA of a certain length 
are numbered from 0 to length-1. So if the position counter $i has reached 
length-2, there's only two more bases (at positions length-2 and length-1), and 
you should quit. Only if the position counter $i is less than length-2 will you still 
have at least three bases left, enough for a codon. So the test succeeds only if: 

Si < (length($dna) -2) 
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(Notice also how the whole expression to the right of the less-than sign is enclosed in 
parentheses; we'll discuss this in Chapter 9 in Section 9.3.1.) 


The line of code: 
Scodon = substr (S$dna, $i 3); 


actually extracts the 3-base codon from the DNA. The call to the substr function 
specifies a substring of Sdna at position $i of length 3, and saves it in the variable 
Scodon. 


If you know you'll need to do this DNA-to-protein translation a lot, you can tur 
Example 8-1 into a subroutine. Whenever you write a subroutine, you have to think 
about which arguments you may want to give the subroutine. So you realize, there may 
come a time when you'll have some large DNA sequence but only want to translate a 
given part of it. Should you add two arguments to the subroutine as beginning and end 
points? You could, but decide not to. It's a judgment call—part of the art of decomposing 
a collection of code into useful fragments. But it might be better to have a subroutine that 
just translates; then you can make it part of a larger subroutine that picks endpoints in the 
sequence, if needed. The thinking is that you'll usually just translate the whole thing and 
always typing in 0 for the start and length($dna)~-1 at the end, would be an 
annoyance. Of course, this depends on what you're doing, so this particular choice just 
illustrates your thinking when you write the code. 


You should also remove the informative print statement at the end, because 
it's more suited to a main program than a subroutine. 


Anyway, you've now thought through the design and just want a subroutine that takes one 
argument containing DNA and returns a peptide translation: 


# dna2peptide 
# 


# A subroutine to translate DNA sequence into a peptide 
sub dna2peptide { 

my(S$dna) = @ ; 

use strict; 

use warnings; 


use BeginPerlBioinfo; # see Chapter 6 about this 
module 





# Initialize variables 
my Sprotein = ''; 
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# Translate each three-base codon to an amino acid, and 
append to a protein 
for(my $i=0; Si < (length(Sdna) - 2) ; Si += 3) { 
Sprotein .= codon2aa( substr(Sdna,$i,3) ); 


} 


return Sprotein; 


} 
Now add subroutine dna2peptide to the BeginPer|Bioinfo.pm module. 


Notice that you've eliminated one of the variables in making the subroutine out of 
Example 8-1: the variable $codon. Why? 


Well, one reason is because you can. In Example 8-1, you were using 
substr to extract the codon from $dna, saving it in variable $codon and then passing 
it into the subroutine Codon2aa. This new way eliminates the middleman. Put the call 
to substr that extracts the codon as the argument to the subroutine Codon2aa so that 
the value is passed in just as before, but without having to copy it to the variable Scodon 
first. 


This has somewhat improved efficiency and speed. Since copying strings is one of the 
slower things computer programs do, eliminating a bunch of string copies is an easy and 
effective way to speed up a program. 


But has it made the program less readable? You be the judge. I think it has, a little, but 
the comment right before the loop seems to make everything clear enough, for me, 
anyway. It's important to have readable code, so if you really need to boost the speed of a 
subroutine, but find it makes the code harder to read, be sure to include enough 
comments for the reader to be able to understand what's going on. 


For the first time Use function calls are being included in a subroutine instead of the 
main program: 

use strict; 

use warnings; 

use BeginPerlBioinfo; 





This may be redundant with the calls in the main program, but it doesn't do any harm 
(Perl checks and loads a module only once). If this subroutine should be called from a 
module that doesn't already load the modules, it's done some good after all. 


Now let's improve how we deal with DNA in files. 


8.5 Reading DNA from Files in FASTA Format 


Over the fairly short history of bioinformatics, several different biologists and 
programmers invented several ways to format sequence data in computer files, and so 


IT-SC 188 


bioinformaticians must deal with these different formats. We need to extract the sequence 
data and the annotations from these files, which requires writing code to deal with each 
different format. 


There are many such formats, perhaps as many as 20 in regular use for DNA alone. The 
very multiplicity of these formats can be an annoyance when you're analyzing a sequence 
in the lab: it becomes necessary to translate from one format to another for the various 
programs you use to examine the sequence. Here are some of the most popular: 


FASTA 


The FASTA and Basic Local Alignment Search Technique (BLAST) programs 
are popular; they both use the FASTA format. Because of its simplicity, the 
FASTA format is perhaps the most widely used of all formats, aside from 
GenBank. 


Genetic Sequence Data Bank (GenBank) 


GenBank is a collection of all publicly released genetic data. It includes lots of 
information in addition to the DNA sequence. It's very important, and we'll be 
looking closely at GenBank files in Chapter 10. 


European Molecular Biology Laboratory (EMBL) 


The EMBL database has substantially the same data as the GenBank and the 
DDBJ (DNA Data Bank of Japan), but the format is somewhat different. 


Simple data, or Applied Biosystems (ABI) sequencer output 


This is DNA sequence data that has no formatting whatsoever, just the characters 
that represent the bases; it is output into files by the sequencing machines from 
ABI and from other machines and programs. 


Protein Identification Resource (PIR) 
PIR is a well-curated collection of protein sequence data. 
Genetics Computer Group (GCG) 


The GCG program (a.k.a. the GCG Wisconsin package) from Accelrys is used at 
many large research institutions. Data must be in GCG format to be usable by 
their programs. 


Of these six sequence formats, GenBank and FASTA are by far the most common. The 


next few sections take you through the process of reading and manipulating data in 
FASTA. 


8.5.1 FASTA Format 


Let's write a subroutine that can handle FASTA-style data. This is useful in its own right 
and as a warm-up for the upcoming chapters on GenBank, PDB, and BLAST. 


FASTA format is basically just lines of sequence data with newlines at the end so it can 
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be printed on a page or displayed on a computer screen. The length of the lines isn't 
specified, but for compatibility, it's best to limit them to 80 characters in length. There is 
also header information, a line or lines at the beginning of the file that start with the 
greater-than > character, that can contain any text whatsoever (or no text). Typically, a 
header line contains the name of the DNA or the gene it comes from, often separated by a 
vertical bar from additional information about the sequence, the experiment that produced 
it, or other, nonsequence information of that nature. 


Much FASTA-aware software insists that there must be only one header line; others 
permit several lines. Our subroutine will accept either one or several header lines plus 
comments beginning with #. 


The following is a FASTA file. We'll call it sample. dna and use it in several programs. 
You should copy it, download it from this book's web site, or make up your own file with 
your own data. 

> sample dna | (This is a typical fasta header.) 


agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg 


tttgcagcggagacgacgcatggggcectgcgcaataggagtacgctgcct 


gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat 
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca 
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga 
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga 
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag 
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca 
gggacaggggttggggccatgcttgctcggggcetctgcttcgcecccacaa 
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc 
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca 
EgGrcggcgcactgaggactgtggqtcactotgatttictgtcgqggacatgad 
gaagttcgggggccccaacaagatccggcagaagtgcecggcectgcgccagt 


gccagctgcgggcccgggaat t t 







































































Legtacaagtacttccettectcgcetcrica 
ccagtgacgccctcagagtccctgccaaggcecccgcecggcecactgcccac 
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg 
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc 
acacctgagccactctcagatgaggaccta 











8.5.2 A Design to Read FASTA Files 


In Chapter 4, you learned how to read in sequence data; here, you just have to extend 
that method to deal with the header lines. You'll also learn how to discard empty lines 
and lines that begin with the pound sign #, i.e., comments in Perl and other languages and 
file formats. (These don't appear in the FASTA file sample. dna just shown.) 


There are two choices when reading in the data. You can read from the open file one line 
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at a time, making decisions as you go. Or, you can slurp the whole file into an array and 
then operate on the array. For very big files, it's sometimes best to read them one line at a 
time, especially if you're looking for some small bit of information. (This is because 
reading a large file into an array uses a large amount of memory. If your system isn't 
robust enough, it may crash.) 


For smaller, normal-sized files, the advantage to reading all the data into an array is that 
you can then easily look through at the data and do operations on it. That's what we'll do 
with our subroutine, but remember, this approach can cause memory space problems with 
larger files, and there are other ways of proceeding. 


Let's write a subroutine that, given as an argument a filename containing FASTA- 
formatted data, returns the sequence data. 


Before doing so you should think about whether you should have just one subroutine, or 
perhaps one subroutine that opens and reads a file, called by another subroutine that 
extracts the sequence data. Let's use two subroutines, keeping in mind that you can reuse 
the subroutine that deals with arbitrary files every time you need to write such a program 
for other formats. 


Let's start with some pseudocode: 
subroutine get data from a file 


argument = filename 





open file 
if can't open, print error message and exit 


read in data and 





return @data 


} 
Subroutine extract sequence data from fasta file 


argument = array of file data in fasta format 





Discard all header lines 
(and blank and comment lines for good measure) 
If first character of first line is >, discard it 





Read in the rest of the file, join ina scalar, 
edit out nonsequence data 


return sequence 
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In the first subroutine that gets data from a file, there's a question as to what's the best 
thing to do when the file can't be read. Here, we're taking the drastic approach: yelling 
"Fire!" and exiting. But you wouldn't necessarily want your program to just stop 
whenever it can't open a file. Maybe you're asking for filenames from the user at the 
keyboard or on a web page, and you'd like to give them three chances to type in the 
filename correctly. Or maybe, if the file can't be opened, you want a default file instead. 


Maybe you can return a false value, such as an empty array, if you can't open the file. 
Then a program that calls this subroutine can exit, try again, or whatever it wants. But 
what if you successfully open the file, but it was absolutely empty? Then you'd have 
succeeded and returned an empty array, and the program calling this subroutine would 
think incorrectly, that the file couldn't be opened. So, that wouldn't work. 


There are other options, such as returning the special "undefined" value. Let's keep what 
we've got, but it's important to remember that handling errors can be an important, and 
sometimes tricky, part of writing robust code, code that responds well in unusual 
circumstances. 


The second subroutine takes the array of FASTA-formatted sequence and returns just the 
unformatted sequence in a string. 


8.5.3 A Subroutine to Read FASTA Files 


Now that you've thought about the problem, written some pseudocode, considered 
alternate ways of designing the subroutines and the costs and benefits of the choices, 
you're ready to code: 


# get file data 
it 


# A subroutine to get data from a file given its filename 








sub gét file data { 





my ($filename) = @ ; 


use strict; 
use warnings; 


# Initialize variables 
my @filedata = ( ); 


unless( open(GET FILE DATA, $filename) ) { 
print STDERR "Cannot open file \"Sfilename\"\n\n"; 
exit; 


} 


@filedata = <GET FILE DATA>; 
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close GET FILE DATA; 


return @filedata; 


} 


# extract sequence from fasta data 
it 


# A subroutine to extract FASTA sequence data from an array 





sub extract sequence from fasta data { 
my(@fasta_ file data) = @ ; 


use strict; 
use warnings; 


# Declare and initialize variables 
my Ssequence = ''; 


foreach my $line (@fasta_ file data) { 


# discard blank line 
if ($line =~ /*\s*S$/) { 


next; 


# discard comment line 
} elsif (Sline =~ /*\s*#/) { 
next; 


# discard fasta header line 
} elsif(Sline =~+ /*>/) { 
next; 


# keep line, add to sequence string 


} else { 
Ssequence .= Sline; 


} 
# remove non-sequence data (in this case, whitespace) 
from $sequence string 


Ssequence =~ s/\s//gq; 


return Ssequence; 
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Notice that nowhere in the code for extract_sequence_from_fasta_data do you check to 
see what's in the file: is it really DNA or protein sequence data in FASTA format? Of 
course, you can write a subroutine—call it is_fasta—that checks the data to see if it's 
what we expect. But I'll leave that for the exercises. 


A few comments about the extract_sequence_from_fasta_data subroutine should be 
made. The following line includes a variable declaration as it is used in a loop: 


foreach my $line (@fasta_ file data) { 
You've seen this in for loops as well. It's convenient to declare these my variables as 
$line on the spot, as they tend to have common names and aren't used outside the loop. 


Some of the regular expressions deserve brief comment. In this line: 


if ($line =~ /*\s*S/) { 

the \s matches whitespace, that is, space, tab, formfeed, carriage return, or newline. \s* 
matches any amount of whitespace (even none). The * matches the beginning of the line, 
and the $ matches the end of the line. So altogether, this regular expression matches 
blank lines with nothing or only whitespace in them. 


This regular expression also has nothing or only whitespace at the beginning of the line, 
up to a pound sign: 


} elsif(Sline =~ /*\s*#/) { 

This expression matches a greater-than sign at the beginning of the line: 

} elsif ($line =~ /*%>/) { 

Finally, the following statement removes whitespace, including newlines: 


Ssequence =~ s/\s//g; 

We've placed these two new subroutines into our BeginPer/Bioinfo.pm module. Now 
let's write a main program for these subroutines and look at the output. First, there's one 
more subroutine to write that handles the printing of long sequences. 


8.5.4 Writing Formatted Sequence Data 


When you try to print the "raw" sequence data, it can be a problem if the data is much 
longer than the width of the page. For most practical purposes, 80 characters is about the 
maximum length you should try to fit across a page. Let's write a print_sequence 
subroutine that takes as its arguments some sequence and a line length and prints out the 
sequence, breaking it up into lines of that length. It will have a strong similarity to the 
dna2peptide subroutine. Here it is: 

# print sequence 

it 


# A subroutine to format and print sequence data 
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sub print sequence { 


my(Ssequence, $length) = @ ; 





use strict; 
use warnings; 


# Print sequence in lines of S$length 

for (my Spos = 0 ; Spos < length(Ssequence) ; Spos += 
Slength ) { 

print substr(Ssequence, Spos, Slength), "\n"; 

} 
i 
The code depends on the behavior of substr, which gives the partial substring at the end 
of the string, even if it's less than the requested length. You can see there's a new 
print_sequence subroutine in the BeginPerlBioinfo.pm module (see Chapter 6). 
We remembered to keep the statement 1; as the last line of the module. Example 8-2 
shows the main program. 


Example 8-2. Read a FASTA file and extract the sequence data 


#!/usr/bin/perl 
# Read a fasta file and extract the sequence data 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 


# Declare and initialize variables 
my @file data = ( ); 
my Sdna = ''; 


# Read in the contents of the file "sample.dna" 
rile date =.get rile date ("sample.dna”); 





# Extract the sequence data from the contents of the file 
"sample.dna" 
sdna = extract sequence from _fasta_data(@file data); 


# Print the sequence in lines 25 characters long 
print sequence (S$dna, 25); 


exit; 

Here's the output of Example 8-2: 
agatggcggcgctgaggggtcttgg 
gggctctaggccggccacctactgg 
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tttgcagcggagacgacgcatgggg 
cctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagt 
agttLgtgqgcgccEettgqcaaccgec 


tgggacgccgccgagt 








tggtctgtgc 
aggttcgcgggtcgctggcgggggt 
cgtgagggagtgcgccgggagcgga 
gat 











Latggagggagatggttcagacc 
cagagcctccagatgccggggagga 
cagcaagtccgagaatggggagaat 
gcocecatclactocarctgqccgca 
aaccggacatcaactgcttcatgat 
cgggtgtgacaactgcaatgagtgg 
ttccatggggactgcatccggatca 
ctgagaagatggccaaggccatccg 
ggagtggtactgtcgggagtgcaga 
gQagaaagaccccaagctagagattc 
gctatcggcacaagaagtcacggga 
gcgggatggcaatgagcgggacagc 
agtgagccccgggatgagggtggag 
ggcgcaagaggcectgtccctgatcc 
agacctgcagcgccgggcagggtca 
gggacaggggttggggccatgcttg 
ctiucgqgggctictgcticgcccecacaa 
abLecrceCcegcagcccreltogEggqcec 
acacccagccagcatcaccagcagc 
agcagcagcagatcaaacggtcagc 
ccgcatgtgtggtgagtgtgaggca 
tgtcggcgcactgaggactgtggtc 
actgtgatttctgtcgggacatgaa 
gaagttcgggggccccaacaagatc 
cggcagaagtgccggctgcgccagt 
gccagctgcgggcccgggaatcgt 


























ta 
CaagtactteGccttcctcgctictca 
ccagtgacgccctcagagticectgc 
caaggccccgccggccactgcccac 
ccaacagcagccacagccatcacag 
aagttagggcgcatccgtgaagatg 
agggggcagtggcgtcatcaacagt 
caaggagcctcctgaggctacagcc 
acacctgagecactchcagatgaggG 
accta 

















8.5.5 A Main Program for Reading DNA and Writing Protein 


Now, one final program for this section. Let's add to the preceding program a translation 
from DNA to protein and print out the protein instead. Notice how short Example 8-3 
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is! As you accumulate useful subroutines in our modules, programs get easier and easier 
to write. 


Example 8-3. Read a DNA FASTA file, translate to protein, and format output 


#!/usr/bin/perl 

# Read a fasta file and extract the DNA sequence data 

# Translate it to protein and print it out in 25-character- 
long lines 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Initialize variables 
my @file data = ( ); 
my Sdna = ''; 

my Sprotein = ''; 


# Read in the contents of the file "sample.dna" 
Grilée data = get file data("sample.dna"™) ; 





# Extract the sequence data from the contents of the file 
"sample.dna" 
Sdna = extract sequence from fasta _data(@file data); 


# Translate the DNA to protein 
Sprotein = dna2Zpeptide ($dna) ; 


# Print the sequence in lines 25 characters long 
DEINe Sequence (Cproteiny 25)? 


exit; 

Here's the output of Example 8-3: 
RWRR_GVLGALGRPPTGLORRRRMG 
PAQ EYAAWEA LEAEVVVGAFATA 
WDAAEWSVOVRGSLAGVVRECAGSG 
DMEGDGSDPEPPDAGEDSKSENGEN 
APIYCICRKPDINCFMIGCDNCNEW 
FHGDCIRITEKMAKATREWYCRECR 
EKDPKLETRYRHKKSRERDGNERDS 
SEPRDEGGGRKRPVPDPDLORRAGS 
GTGVGAMLARGSAS PHKSSPQPLVA 
TPSQHHQQQQOO IKRSARMCGECEA 
CRRTEDCGHCDFCRDMKKFGGPNKI 
ROKCRLROCOQLRARESYKYFPSSLS 
PVTPSESLPRPRRPLPTQOQOPOPSQO 
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KLGRIREDEGAVASSTVKEPPEATA 
TPEPLSDEDL 











8.6 Reading Frames 


The biologist knows that, given a sequence of DNA, it is necessary to examine all six 
reading frames of the DNA to find the coding regions the cell uses to make proteins. 


8.6.1 What Are Reading Frames? 


Very often you won't know where in the DNA you're studying the cell actually begins 
translating the DNA into protein. Only about 1-1.5% of human DNA is in genes, which 
are the parts of DNA used for the translation into proteins. Furthermore, genes very often 
occur in pieces that are spliced together during the transcription/translation process. 


If you don't know where the translation starts, you have to consider the six possible 
reading frames. Since the codons are three bases long, the translation happens in three 
"frames," for instance starting at the first base, or the second, or perhaps the third. (The 
fourth would be the same as starting from the first.) Each starting place gives a different 
series of codons, and, as a result, a different series of amino acids. 


Also, transcription and translation can happen on either strand of the DNA; that is, either 
the DNA sequence, or its reverse complement, might contain DNA code that is actually 
translated. The reverse complement can also be read in any one of three frames. So a total 
of six reading frames have to be considered when looking for coding regions , that part of 
the DNA that encodes proteins. 


It is therefore quite common to examine all six reading frames of a DNA sequence and to 
look at the resulting protein translations for long stretches of amino acids that lack stop 
codons. 


The stop codons are definite breaks in the DNA~?protein translation process. During 
translation (actually of RNA to protein, but I'm being deliberately informal and vague 
about the biochemistry), if a stop codon is reached, the translation stops, and the growing 
peptide chain grows no more. 


Long stretches of DNA that don't contain any stop codons are called open reading frames 
(ORFs) and are important clues to the presence of a gene in the DNA under study. So 
gene finder programs need to perform the type of reading frame analysis we'll do in this 
chapter. 


8.6.2 Translating Reading Frames 


Based on the facts just presented, let's write some code that translates the DNA in all six 
reading frames. 


In the real world, you'd look around for some subroutines that are already written to do 
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that task. Given the basic nature of the task—something anyone who studies DNA has to 
do—you'd likely find something. But this is a tutorial, not the real world, so let's soldier 
on. 


This problem doesn't sound too daunting. So, take stock of the subroutines at your 
disposal, think of where you are and how you can get to your destination. 


Looking through the subroutines we've already written, recall dna2peptide. You may 
recall considering adding some arguments to specify starting and end points. Let's do this 
now. 


Remember that although we calculated reverse complements back in Chapter 4, we 
never made a subroutine out of it. So let's start there: 


# revcom 

it 

# A subroutine to compute the reverse complement of DNA 
sequence 





sub revcom { 


my(S$dna) = @ ; 





# First reverse the sequence 
my(Srevcom) = reverse(Sdna); 


# Next, complement the sequence, dealing with upper and 
lower case 

# A->T, T->A, C->G, G->C 

Srevcom =~ tr/ACGTacgt/TGCAtgca/; 





return Srevcom; 


} 


Now, a little pseudocode to sketch an idea for the subroutine that will translate specific 
ranges of DNA: 


Given DNA sequence 


subroutine translate frame ( DNA, start, end) 





tes 


return dna2peptide( substr( DNA, start, end - start + 
1) ) 


} 


That went well! Luckily, the swbstr built-in Perl function made it easy to apply the 
desired start and end points, while passing the DNA into the already written dna2peptide 
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subroutine. 


Note that the length of the sequence is end-start+1. To give a small example: if you 
start at position 3 and end at position 5, you've got the bases at positions 3, 4, and 5, three 
bases in all, which is exactly what 5 - 3 + 1 equals. 


Dealing with indices like this has to be done carefully, or the code won't work. For many 
programs, this is the worst the mathematics gets. 


_— Pay attention to the indices! 





You have to decide if you wish to keep the numbering of positions from 0, which is Perl's 
way to do it, or the first character of the sequence is in position 1, which is the biologist's 
way to do it. Let's do it the biologist's way. The positions will be decreased by one when 
passed to the Perl function substr, which, of course, does it Perl's way. 


The corrected pseudocode looks like this: 


Given DNA sequence 





subroutine translate frame ( DNA, start, erid) 


# start and end are numbering the sequence from 1 to 
length 


return dna2peptide( substr( DNA, start - 1, end - start 
+ 1) ) 
} 
The length of the desired sequence doesn't change with the change in indices, since: 
(end - 1) - (start - 1) + 1 = end - start +1 
So let's write this subroutine: 
# translate frame 
: A subroutine to translate a frame of DNA 
sub trenslate frame 4 


my($seq, $start, Send) = @ ; 


my Sprotein; 
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# To make the subroutine easier to use, you won't need 
to specify 

# the end point--it will just go to the end of the 
sequence 


# by default. 
unless(Send) { 
Send = length(Sseq) ; 





} 


# Finally, calculate and return the translation 
return dna2peptide ( substr ( Sseq, $start - 1, 


Send -Sstart + 1) ); 
} 


Example 8-4 translates the DNA in all six reading frames. 


Example 8-4. Translate a DNA sequence in all six reading frames 


#!/usr/bin/perl 


# Translate a DNA sequence in all six reading frames 


use strict; 
use warnings; 


use BeginPerlBioinfo; # see Chapter 6 about this module 


# Initialize variables 
my @file data = ( ); 
my Sdna = ''; 

my Srevcom = ''; 


my Sprotein = ''; 


# Read in the contents of the fil 





# Extract the sequence data from 
"sample.dna" 


e "sample.dna" 


Gfile data = get file data("sample.dna™) ; 


the contents of the file 


Sdna = extract sequence from _fasta_data(@file data); 


# Translate the DNA to protein in six reading frames 


# and print the protein in lines 70 characters long 
Crine “ih easesss Reading Frame 1-<------ \n\n"; 
Sprotein = translate frame(Sdna, 1); 


print sequence (Sprotein, 70)3 


pring. ™ Sasa Reading Frame 2==------ \non"; 


Sprotein = translate frame ($dna, 


fe 


PFInt sequence (Sprotein, 70); 
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2) 5 


201 


pring. “i =seeee= Reading Frame 3-------- in"; 


Sprotein = translate frame(Sdna, 3); 
pLINE. sequence (Vprotein; 70)? 





# Calculate reverse complement 


Srevcom = revcom($dna) ; 
peine, VA Hanes Reading Frame 4-------- \ni\n": 
Sprotein = translate frame(Srevcom, 1); 


Prin’. sequence (Sprocein, 70); 


ering: “a asses Reading Frame 3-=---<--- Wns 
Sprotein = translate frame(Srevcom, 2); 

print. sequence (Sprotein, 70); 

pene “hi Hees Reading Frame 6-------- Wain" 
Sprotein = translate frame(S$revcom, 3); 











fe 


Print sequence (Sproetein, 70); 


exit; 
Here's the output of Example 8-4: 


ian aaa Reading Frame b<--S=--+— 


RWRR_GVLGALGRPPTGLORRRRMGPAQ EYAAWEA LEAEVVVGAFATAWDAAEWSVQ 
VRGSLAGVVRE 

CAGSGDMEGDGS DPEPPDAGEDSKSENGENAPTYCICRKPDINCFMIGCDNCNEWFHGD 
CIRITEKMAKA 
TREWYCRECREKDPKLETRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLORRAG 
SGTGVGAMLAR 
GSAS PHKSS POPLVAT PSQHHOQOQOOO TKRSARMCGECEACRRTEDCGHCDFCRDMKKF 
GGPNKIROKCR 
LROCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTOQOPOPSQKLGRIREDEGAVAS 
STVKEPPEATA 

TPEPLSDEDL 




















DGGAEGSWGL_AGHLLVCSGDDAWGLRNRSTLPGRRD KRK_ LWAPLQPPGTPPSGLCR 
FAGRWRGS GS 

APGAE IWREMVOTOSLOMPGRTASPRMGRMRPSTASAANRTSTAS SGVTTAMSGSMGT 
ASGSLRRWPRP 
SGSGTVGSAERKTPS_ RFAIGTRSHGSGMAMSGTAVSPGMRVEGARGLSLIQTCSAGQG 
QGOGLGPCLLG 
ALLRPTNPLRS PWWPHPAS I TSSSSSRSNGQPACVVSVRHVGALRTVVTIVISVGT_RSS 
GAPTRSGRSAG 

CASASCGPGNRTSTSLPRSHQ RPQSPCQGPAGHCPPNSSHSHHRS_GASVKMRGQWRH 
QQSRSLLRLOP 
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HLSHSQMRT 


MAALRGLGGSRPAT YWFAAETTHGACAIGVRCLGGVTRSGSSCGRLCNRLGRRRVVCAG 
SRVAGGGREGV 

RRERRY GGRWFRPRASRCRGGQQVREWGECAHLLHLPOTGHQLLHDRV_QLQ VVPWGL 
HPDH_ EDGQGH 

PGVVLSGVORERPQARDSLSAQEVTGAGWOQ_AGQQ APG GWRAQEACP SRPAAPGRV 
RDRGWGHACSG 

LCFAPQILSAALGGHTQPAS PAAAAADOTVSPHVW_V_GMSAH GLWSL_FLSGHEEVR 
GPQQDPAEVPA 
APVPAAGPGIVOQVLPFLALTSDALRVPAKAPPATAHPTAATAITEVRAHP R_ GGSGVI 
NSQGAS_GYSH 

T ATLR GP 





_VLI_EWLRCGCSLRRLLDC _ 

_RHCPLIFTDAP LL WLWLLLGGQWPAGPWQGL GRHW_ ERGREVLVR 
FPGPOLALAQPALLPDLVGAPELLHVPTEITVTTVLSAPTCLTLTTHAG PFDLLLLLL 
VMLAGCGHOGL 
RRGFVGRSRAPSKHGPNPCP_ PCPALOQVWIRDRPLAPSTLIPGLTAVPLIATPLP LLV 
PIANL LGVFL 
SALPTVPLPDGLGHLLSDPDAVPMEPLIAVVTPDHEAVDVRFAADAVDGRILPILGLAV 
LPGIWRLWV_T 

ISLHISAPGALPHDPRORPANLHRPLGGVPGGCKGAHNYFRF SRLPGSVLLLRRPHAS 
SPLOTSRWPA _ 

SPODPSAPPS 





RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDSEGVT 
GESEEGKYLYD 

SRARSWHWRSRHFCRILLGPPNFFMSROKSQ PQSSVRRHASHSPHMRADRLICCCCCW 
_CWLGVATKGC 
GEDLWGEAEPRASMAPTPVPDPARRCRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLC 
R_RISSLGSFS 
LHSROYHSRMALATFSVIRMQS PWNHSLOLSHPIMKQLMSGLROMQ_MGAFSPFSDLLS 
SPASGGSGSEP 

SPSITSPLPAHSLTTPASDPRTCT DHSAASQAVAKAPTTT SASSHASQAAYSYCAGPMRR 
LRCKPVGGRPR 

APKTPORRH 
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GPHLRVAQVWL PQEAP LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTLRASL 
VRARKGSTCTI 

PGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHS POCADMPHTHHTCGLTV_SAAAAAG 
DAGWVWPPRAA 

ERICGAKOS PEOAWPOPLSLTLPGAAGLDOGOASCALHPHPGAHCCPAHCHPAPVTSCA 
DSESLAWGLSL 

CTPDSTTPGWPWPSSQ SGCSPHGTTHCSCHTRS SS CPVCGRCSRWAHSPHSRTCCP 
PRHLEALGLNH 
LPPYLRSRRTPSRPPPATREPAOTTRRRPRRLORRPOLLPLLVTPPRORTPIAQAPCVV 
SAANQ_ VAGLE 

PPRPLSAAT 





8.7 Exercises 


Exercise 8.1 


Write a subroutine that checks a string and returns t rue if it's a DNA sequence. 
Write another that checks for protein sequence data. 


Exercise 8.2 
Write a program that can search by name for a gene in an unsorted array. 
Exercise 6,3 


Write a program that can search by name for a gene in a sorted array; use the Perl 
sort function to sort an array. For extra credit: write a binary search subroutine to 
do the searching. 


Exercise 8.4 


Write a subroutine that inserts an element into a sorted array. Hint: use the splice 
Perl function to insert the element, as shown in Chapter 4. 


Exercise 8.5 


Write a program that searches by name for a gene in a hash. Get the genes from 
your own work or try downloading a list of all genes for a given organism from 
www.ncbi.nim.nih.gov or one of the web sites given in Appendix A. Make 
a hash of all the genes (key=name, value=gene ID or sequence). Hint: you may 
have to write a short Perl program to reformat the list of genes you start with to 
make it easy to populate the Perl hash. 


Exercise 8.6 


Write a subroutine that checks an array of data and returns true if it's in FASTA format. 
Note that FASTA expects the standard IUB/IUPAC amino acid and nucleic acid codes, 
plus the dash (-) that represents a gap of unknown length. Also, the asterisk (*) represents 
a stop codon for amino acids. Be careful using an asterisk in regular expressions; use a \* 
to escape it to match an actual asterisk. 


The remaining problems deal with the effect of mutations in DNA on the proteins they 
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encode. They combine the subject of randomization and mutations from Chapter 7 plus 
the subject of the genetic code from this chapter. 


Exercise 8.7 


For each codon, make note of what effect single nucleotide mutations have on the 
codon: does the same amino acid result, or does the codon now encode a different 
amino acid? Which one? Write a subroutine that, given a codon, returns a list of 
all the amino acids that may result from any single mutation in the codon. 


Exercise 8.8 


Write a subroutine that, given an amino acid, randomly changes it to one of the 
amino acids calculated in Exercise 8.7. 


Exercise 8.9 


Write a program that randomly mutates the amino acids in a protein but restricts 
the possibilities to those that can occur due to a single mutation in the original 
codons, as in Exercises 8.7 and 8.8. 


Exercise 8.10 


Some codons are more likely than others to occur in random DNA. For instance, 
there are 6 of the 64 possible codons that code for the amino acid serine, but only 
2 of the 64 codes for phenylalanine. Write a subroutine that, given an amino acid, 
returns the probability that it's coded by a randomly generated codon (see 


Chapter 7). 
Exercise 8.11 


Write a subroutine that takes as arguments an amino acid; a position 1, 2, or 3; 
and a nucleotide. It then takes each codon that encodes the specified amino acid 
(there may be from one to six such codons), and mutates it at the specified 
position to the specified nucleotide. Finally, it returns the set of amino acids that 
are encoded by the mutated codons. 


Exercise 8.12 


Write a program that, given two amino acids, returns the probability that a single 
mutation in their underlying (but unspecified) codons results in the codon of one 
amino acid mutating to the codon of the other amino acid. 
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Chapter 9. Restriction Maps and Regular 
Expressions 


In this chapter, I'll give an overview of Perl regular expressions and Perl operators, two 
essential features of the language we've been using all along. We'll also investigate the 
programming of a standard, fundamental molecular-biology technique: the discovery of a 
restriction map for a sequence. Restriction digests were one of the original ways to 
"fingerprint" DNA; this can now be simulated on the computer. 


Restriction maps and their associated restriction digests are common calculations in the 
laboratory and are provided by several software packages. They are essential tools in the 
planning of cloning experiments; they can be used to insert a desired stretch of DNA into 
a cloning vector, for instance. Restriction maps also find application in sequencing 
projects, for instance in shotgun or directed sequencing. 


9.1 Regular Expressions 


We've been dealing with regular expressions for a while now. This section fills in some 
background an.d ties together the somewhat scattered discussions of regular expressions 
from earlier parts of the book. 


Regular expressions are interesting, important, and rich in capabilities. Jeffrey Friedl's 
book Mastering Regular Expressions (O'Reilly) is entirely devoted to them. Perl 
makes particularly good use of regular expressions, and the Perl documentation explains 
them well. Regular expressions are useful when programming with biological data such 
as sequence, or with GenBank, PDB, and BLAST files. 


Regular expressions are ways of representing—and searching for—many strings with one 
string. Although they are not strictly the same thing, it's useful to think of regular 
expressions as a kind of highly developed set of wildcards. The special characters in 
regular expressions are more properly known as metacharacters. 


Most people are familiar with wildcards, which are found in search engines or in the 
game of poker. You might find the reference to every word that starts with biolog by 
typing biolog™*, for instance. Or you may find yourself holding five aces. (Different 
situations may use different wildcards. Perl regular expressions use * to mean "0 or more 
of the preceding item," not "followed by anything" as in the wildcard example just given.) 


In computer science, these kinds of wildcards or metacharacters have an important 
history, both practically and theoretically. The asterisk character in particular is called the 
Kleene closure after the eminent logician who invented it. As a nod to the theory, I'll 
mention there is a simple model of a computer, less powerful than a Turing machine, that 
can deal with exactly the same kinds of languages that can be described by regular 
expressions. This machine model is called a finite state automaton. But enough theory for 
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now. 


We've already seen many examples that use regular expressions to find things ina DNA 
or protein sequence. Here I'll talk briefly about the fundamental ideas behind regular 
expressions as an introduction to some terminology. There is a useful summary of 
regular-expression features in ADpendix B. Finally, we'll see how to learn more about 
them in the Perl documentation. 


So let's start with a practical example that should be familiar by now to those who have 
been reading this text sequentially: using character classes to search DNA. Let's say there 
is a small motif you'd like to find in your library of DNA that is six basepairs long: CT 
followed by C or G or T followed by ACG. The third nucleotide in this motif is never A, 
but it can be C, G, or T. You can make a regular expression by letting the character class 
[CGT] stand for the variable position. The motif can then be represented by a regular 
expression that looks like this: CT[CGT]ACG. This is a motif that is six base pairs long 
with a C,G, or T in the third position. If your DNA was in a scalar variable S$dna, you 
can test for the presence of the motif by using the regular expression as a conditional test 
in a pattern-matching statement, like so: 


if( Sdna =~ /CT[CGT]ACG/ ) { 
print "I found the motif! !\n"; 
} 


Regular expressions are based on three fundamental ideas: 


Repetition (or closure) 


The asterisk (*), also called Kleene closure or star, indicates 0 or more repetitions 
of the character just before it. For example, abc* matches any of these strings: 
ab, abc, abcc, abccc, abcccc, and so on. The regular expression matches an 
infinite number of strings. 


Alternation 
In Perl, the pattern (a|b) (read: a or b) matches the string a or the string b. 
Concatenation 


This is a real obvious one. In Perl, the string ab means the character a followed by 
(concatenated with) the character b. 


The use of parentheses for grouping is important: they are also metacharacters. So, for 
instance, the string (abc|def) z*x matches such strings as abcx, abczx, abczzx, 
defx, defzx, defzzzzzx, and so on. In English, it matches either abc or def 
followed by zero or more z's, and ending with an x. This example combines the ideas of 
grouping, alternation, closure, and concatenation. The real power of regular expressions 
is seen in this combining of the three fundamental ideas. 


Perl has many regular-expression features. They are basically shortcuts for the three 
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fundamental ideas we've just seen—repetition, alternation, and concatenation. For 
instance, the character class shown earlier can be written using alternation as (C|G|T). 
Another common feature is the period, which can stand for any character, except a 
newline. So ACG. *GCA stands for any DNA that starts with ACG and ends with GCA. In 
English, this reads as: ACG followed by 0 or more characters followed by GCA. 


In Perl, regular expressions are usually enclosed within forward slashes and are used as 
pattern-matching specifiers. Check the documentation (or Appendix _B), for m//, 
which includes some options that affect the behavior of the regular expressions. Regular 
expressions are also used in many of Perl's built-in commands, as you will see. 


The Perl documentation is essential: start with the perlre section of the 


Perl manual at http://www.perldoc.com/perl5.6/pod/perlre.html#top. 
9.2 Restriction Maps and Restriction Enzymes 


One of the great discoveries in molecular biology, which paved the way for the current 
golden age in biological research, was the discovery of restriction enzymes. For the 
nonbiologist, and to help set up the programming material that follows, here's a short 
overview. 


9.2.1 Background 


Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, 
the popular restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts 
where it finds GAATTC, between the G and A. Actually, it cuts both complementary 
strands, leaving an overhang on each end. These "sticky ends" of a few bases in single 
strands make it possible for the fragments to re-form, making possible the insertion of 
DNA into vectors for cloning and sequencing, for instance. HindIII cuts at AAGCTT and 
cuts between the As. Some restriction enzymes cut in the middle and result in "blunt 
ends" with no overhang. About 1,000 restriction enzymes are known. 


If you look at the reverse complement of the restriction enzyme EcoRI, you see it's 
GAATTC, the same sequence. This is a biological version of a palindrome, a word that 
reads the same in reverse. Many restriction sites are palindromes. 


Computing restriction maps is a common and practical bioinformatics 
calculation in the laboratory. Restriction maps are computed to plan 
experiments, to find the best way to cut DNA to insert a gene, to make 
a_ site-specific mutation, or for several other applications of 
recombinant DNA techniques. By computing first, the laboratory 
scientist saves considerably on the necessary trial-and-error at the 
laboratory bench. Look for more about restriction enzymes at 


http://www.neb.com/rebase/rebase.html. 


We'll now write a program that does something useful in the lab: it will look for 
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restriction enzymes in a sequence of DNA and report back with a restriction map of 
exactly where in the DNA the restriction enzymes appear. 


9.2.2 Planning the Program 


Back in Chapter _5, you saw how to look for regular expressions in text. So you've an 
idea of how to find motifs in sequences with Perl. Now let's think about how to use those 
techniques to create restriction maps. Here are some questions to ask: 


Where do I find restriction enzyme data? 


Restriction enzyme data can be found at the Restriction Enzyme Database, 
(REBASE), which is on the Web at 


http://www.neb.com/rebase/rebase.html. 
How do I represent restriction enzymes in regular expressions? 


Exploring that site, you'll see that restriction enzymes are represented in their own 
language. We'll try to translate that language into the language of regular 
expressions. 


How do I store restriction enzyme data? 


There are about 1,000 restriction enzymes with names and definitions. This makes 
them candidates for the fast key-value type of lookup hashes provide. When you 
write a real application, say for the Web, it's a good idea to create a DBM file to 
store the information, ready to use when a program needs a lookup. I will cover 
DBM files in Chapter 10; here, I'll just demonstrate the principle. We'll keep 
only a few restriction enzyme definitions in the program. 


How do I accept queries from the user? 


You can ask for a restriction enzyme name, or you can allow the user to type in a 
regular expression directly. We'll do the first. Also, you want to let the user 
specify which sequence to use. Again, to simplify matters, you'll just read in the 
data from a sample DNA file. 


How do I report back the restriction map to the user? 


This is an important question. The simplest way is to generate a list of positions 
with the names of the restriction enzymes found there. This is useful for further 
processing, as it presents the information very simply. 

But what if you don't want to do further processing; you just want to 
communicate the restriction map to the user? Then, perhaps it'd be more useful to 
present a graphical display, perhaps print out the sequence with a line above it 
that flags the presence of the enzymes. 

There are lots of fancy bells and whistles you can use, but let's do it the simple 
way for now and output a list. 


So, the plan is to write a program that includes restriction enzyme data translated into 
regular expressions, stored as the values of the keys of the restriction enzyme names. 
DNA sequence data will be used from the file, and the user will be prompted for names 
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of restriction enzymes. The appropriate regular expression will be retrieved from the hash, 
and we'll search for all instances of that regular expression, plus their locations. Finally, 
the list of locations found will be returned. 


9.2.3 Restriction Enzyme Data 


The restriction enzyme data is available in a variety of formats, as a visit to the REBASE 
web site will show you. After looking around, you decide to get the information from the 
bionet file, which has a fairly simple layout. Here's the header and a few restriction 
enzymes from that file: 

REBASE version 104 

bionet.104 











REBASE, The Restriction Enzyme Database 
http://rebase.neb.com 

Copyright (c) Dr. Richard J. Roberts, 2001. All 
rights reserved. 




















Rich Roberts 







































































Mar 30 2001 

Aaal (XmalII) C*GGCCG 
AaclI (BamHI) GGATCC 
AaelI (BamHT) GGATCC 
Aagl (Clal) AT*CGAT 
AagI (ApaLT) GTGCAC 
Aarl CACCTGCNNNN*%* 
Aarl “NNNNNNNNGCAGGTG 
AatI (StuT) AGG*CCT 
AatII GACGT*C 
Aaul (Bsp1407T) T“GTACA 
Abal (BclI) T*GATCA 
Abel (BbvCI) CC*TCAGC 
AbelI (BbvCI) GC“ TGAGG 
AbrI (XhoT) C*“TCGAG 
Acal (AsulIlI) TTCGAA 
AcaII (BamHT) GGATCC 
AcalII (MstTI) TGCGCA 
AcalV (HaelIl) GGCC 
AcclI GT*MKAC 
AccII (FnuDII) CG*CG 
AccIII (BspMIT) T*CCGGA 
Accl6I (MstTI) TGC*GCA 
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Acc36I (BspMI) ACCTGCNNNN* 
Acc36I (BspMT) “NNNNNNNNGCAGGT 
Acc38I (EcoRIT) CCWGG 

Acco5I (KpnI) G*GTACC 

Accl131 (Seal) AGT*ACT 

AccBl1I (HgiCI) G*GYRCC 

AccB2I (HaelIl) RGCGC*Y 

AccB7I (Pf1MI) CCANNNN*NTGG 
AccBSI (BsrBTI) CCE*ErEe 

AccBSI (BsrBT) GAG*CGG 

AccEBI (BamHI) G*GATCC 

Acel (TselI) G*CWGC 

AcelIlI (NhelI) GCTAG*C 

AcellII CAGCTCNNNNNNN* 
AcellII “NNNNNNNNNNNGAGCTG 
Acil C*CGC 

Acil G*CGG 

AclI AA*CGTT 

Ac1NI (Spel) A*CTAGT 

AclWI (BinlI) GGATCNNNN* 














Your first task is to read this file and get the names and the recognition site (or restriction 
site) for each enzyme.To simplify matters for now, simply discard the parenthesized 
enzyme names. 


How can this data be read? 


Discard header lines 


For each data line: 





remove parenthesized names, for simplicity's sake 
get and store the name and the recognition site 


Translate the recognition sites to regular expressions 
--but keep the recognition site, for printing out 
results 


} 


return the names, recognition sites, and the regular 
expressions 





This is high-level undetailed pseudocode, so let's refine and expand it. (Notice that the 
curly brace isn't properly matched. That's okay, because there are no syntax rules for 
pseudocode; do whatever works for you!) Here's some pseudocode that discards the 
header lines: 
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foreach line 
if /Rich Roberts/ 
break out of the foreach loop 


} 


This is based on the format of the file, in which the string you're looking for is the last 
text before the data lines start. (Of course, if the format of the file should change, this 
might no longer work.) 


Now let's further expand the pseudocode, thinking how to do the tasks involved: 


# Discard header lines 
# This keeps reading lines, up to a line containing "Rich 
Roberts" 
foreach line 
if /Rich Roberts/ 
break out of the foreach loop 


} 
For each data line: 
# Split the two or three (if there's a parenthesized 


name) fields 
fields = seliey * ™; 6.4 





# Get and store the name and the recognition site 
Sname = shift @fields; 


Ssite = pop @fields; 


# Translate the recognition sites to regular 
expressions 
--but keep the recognition site, for printing out 
results 


} 





return the names, recognition sites, and the regular 
expressions 


This isn't the translation, but let's look at what you've done. 
First, you want to extract the name and recognition site data from a string. The most 


common way to separate words in a line of Perl, especially if the string is nicely 
formatted, is with the Perl built-in function split . 
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If you have two or three per line that have whitespace and are separated from each other 
by whitespace, you can get them into an array with the following simple call to split 
(which acts on the line as stored in the special variable @_.: 


(Sname, Ssite) = split(" ") 

The @fields array may have two or three elements depending on whether there was a 
parenthesized alternate enzyme named. But you always want the first and the last 
elements: 

Sname = shift@fields; 

Ssite = pop@fields; 


You now have the problem of translating the recognition site to a regular expression. 


Looking over the recognition sites and having read the documentation on REBASE you 
found on its web site, you know that the cut site is represented by the caret (*). This 
doesn't help make a regular expression that finds the site in sequence, so you should 
remove it (see Exercise 9.6 in the Section 9.4 section). 


Also notice that the bases given in the recognition sites are not just the bases A, C, G, and 
T, but they also use the more extended alphabet presented in Table 4-1. These 
additional letters include a letter for every possible group of two, three, or four bases. 
They're really like abbreviations for character classes in that respect. Aha! Let's write a 
subroutine that substitutes character classes for these codes, and then we'll have our 
regular expression. 


Of course, REBASE uses them, because a given restriction enzyme might well match a 
few different recognition sites. 


Example 9-1 is a subroutine that, given a string, translates these codes into character 
classes. 


Example 9-1. Translate IUB ambiguity codes to regular expressions 


# IUB_ to regexp 

# 

# A subroutine that, given a sequence with IUB ambiguity 
codes, 

# outputs a translation with IUB codes changed to regular 
expressions 








These are the IUB ambiguity codes 
Eur. J. Biochem. 150: 1-5, 1985): 
or 
or 
or 
or 
or 








Se SE OSE OSE SHE SHE HE SHE 


( 
R 
Bi 
M = 
K 
S 


QANrAMDN 
ORG. Al 
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se He HEHE SHE HE: 
eh a he a, SI 


sub 


} 


=Aorwd 

= not A (C or G or T) 
= not C (A or G or T) 
= not G (A or C or T) 
= not T (A or C or G) 





= A or Cor -G or 7 
IUB ta fegexp { 
my(Siub) = @ ; 
my 


my Sregular expression = ; 


my Siub2character class = ( 





A= 'A', 
Cae et, 

Ce => sGN, 
T= 'T', 

R => '[GA]', 
Y => "(CT)", 
M => '[AC]', 
K => '[GT]', 
Ss => '[GC]', 
W => '[AT]', 
B => '[CGT]', 
D => '[AGT]', 
H => '[ACT]', 
V => '[ACG]', 
N => '[ACGT]', 





iz 
# Remove the “ signs from the recognition sites 
Siub =~ 8/\*//a; 


# Translate each character in the iub sequence 
for (my Si =0 ; Si < length(Siub) ; ++Si ) { 
Sregular expression 
«= Siub2character class{substr(Siub, $1, 1)}7 


} 


return Sregular expression; 


It seems you're almost ready to write a subroutine to get the data from the REBASE 
datafile. But there's one important item you haven't addressed: what exactly is the data 
you want to return? 


IT-SC 


214 


You plan to return three data items per line of the original REBASE file: the enzyme 
name, the recognition site, and the regular expression. This doesn't fit easily into a hash. 
You can return an array that stores these three data items in three consecutive slots. This 
can work: to read the data, you'd have to read groups of three items from the array. It's 
doable but might make lookup a little difficult. As you get into more advanced Perl, 
you'll find that you can create your own complex data structures. 


Since you've learned about sp/it, maybe you can have a hash in which the key is the 
enzyme name, and the value is a string with the recognition site and the regular 


expression separated by whitespace. Then you can look up the data fast and just extract 
the desired values using sp/it. Example 9-2 shows this method. 


Example 9-2. Subroutine to parse a REBASE datafile 


# parseREBASE--Parse REBASE bionet file 








# 

# A subroutine to return a hash where 

# key = restriction enzyme name 

# value = whitespace-separated recognition site and 


regular expression 
sub parseREBASE { 


my (Srebasefile) = @ ; 





use strict; 

use warnings; 

use BeginPerlBioinfo; # see Chapter 6 about this 
module 





# Declare variables 

my @rebasefile = ( ); 
my stébese hash = ( 7 
my Sname; 

my Ssite; 

my Sregexp; 





# Read in the REBASE file 
@rebasefile = get_file data (Srebasefile) ; 





foreach ( @rebasefile ) f 


# Discard header lines 
{( 1 .. /Rich Roberts/ ) and next; 


# Discard blank lines 
/*\e*S/ and next; 
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# Split the two (or three if includes parenthesized 
name) fields 
my @fields = split( " ", $_); 


# Get and store the name and the recognition site 


# Remove parenthesized names, for simplicity's sake, 
# by not saving the middle field, if any, 

# just the first and last 

Sname = shift @fields; 











Ssite = pop @fields; 


# Translate the recognition sites to regular 
expressions 
Sregexp = IUB to regexp (Ssite); 


# Store the data into the hash 
erebase hashiSname} = “Ssite Sregqexp”; 


} 


# Return the hash containing the reformatted REBASE 
data 

return sréebase hash; 
} 
This parseREBASE subroutine does quite a lot. Is there, however, too much in one 
subroutine; should it be rewritten? It's a good question to ask yourself as you're writing 
code. In this case, let's leave it as it is. However, in addition to doing a lot, it also does it 
in a few new ways, which we'll look at now. 


9.2.4 Logical Operators and the Range Operator 


You're using a foreach loop to process the lines of the bionet file stored in the 
@rebasefile array. 


Within that loop you use a new feature of Perl to skip the header lines, called the range 
operator (..), which is used in this line: 


( 1 .. /Rich Roberts/ ) and next; 


This has the effect of skipping everything from the first line up to and including the line 
with "Rich Roberts," in other words, the header lines. (Range operators must have at least 
one of their endpoints given as a number to work like this.) 


The and function is a logical operator. Logical operators are available in most 
programming languages. In Perl they've become very popular, so although we haven't 
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used them a great deal in this book, you'll often come across code that does. In fact, you'll 
start to see them a bit more as the book continues. 


Logical operators can test if two conditions are both t rue, for instance: 
if( $string eq 'kinase' and Snum == 3) { 

} 

Only if both the conditions are t rue is the entire statement t rue. 


Similarly, with logical operators you can test if at least one of the conditions is true 
using the or operator, for instance: 


if( Sstring eq 'kinase' or Snum == 3) { 
} 
Here, the if statement is true if either or both of the conditionals are true. 


There is also the not logical operator, a negation operator with which you can test if 
something is false: 


if( not 6 ==9) { 
} 


== 9 returns false, which is negated by the not operator, so the entire conditional 
returns true. 


There are also the closely related operators, && for and, | | for or, and ! for not. These 
have slightly different behavior (actually, different precedence); most Perl code uses the 
versions I've shown, but both are common. 


When in doubt about precedence, you can always parenthesize expressions to ensure your 
statement means what you intend it to mean. (See Section 9.3.1 later in this chapter.) 


Logical operators also have an order of evaluation, which makes them useful for 
controlling the flow of programs. Let's take a look at how the and operator evaluates its 
two arguments. It first evaluates the left argument, and if it's true, evaluates and returns 
the right. If the left argument evaluates to false, the right argument is never touched. 
So the and operator can act like a mini if statement. For instance, the following two 
examples are equivalent: 


if( Sverbose ) { 
print Shelpful but verbose message; 





} 
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Sverbose and print Shelpful but verbose message; 

Of course, the /f statement is more flexible, because it allows you to easily add more 
statements to the block, and e/s/f and e/se conditions to their own blocks. But for simple 
situations, the and operator works well.” 





"1 You can even chain logical operators one after the other to build up more complicated 
expressions and use parentheses to group them. Personally, I don't like that style much, but 
in Perl, there's more than one way to do it! 


The logical operator Or evaluates and returns the left argument if it's true; if the left 
argument doesn't evaluate to true, the Or operator then evaluates and returns the right 
argument. So here's another way to write a one-line statement that you'll often see in Perl 
programs: 

open (MYFILE, Sfile) or die "I cannot open file Sfile: $!"; 


This is basically equivalent to our frequent: 


unless (open(MYFILE, Sfile)) { 

print "I cannot open file $file\n"; 

exit; 
} 
Let's go back and take a look at the parseREBASE subroutine with the line: 
( 1 .. /Rich Roberts/ ) and next; 
The left argument is the range 1 .. /Rich Roberts/. When you're in that range of 
lines, the range operator returns a true value. Because it's true, the and boolean 
operator goes on to see if the value on the other side is t rue and finds the next function, 
which evaluates to true, even as it takes you back to the "next" iteration of the 
enclosing foreach loop. So if you're between the first line and the Rich Roberts line, 
you skip the rest of the loop. 


Similarly, the line: 
/*\s*$/ and next; 


takes you back to the next iteration of the foreach if the left argument, which matches 
a blank line, is true. 


The other parts of this parseREBASE subroutine have already been discussed, during the 
design phase. 


9.2.5 Finding the Restriction Sites 


So now it's time to write a main program and see our code in action. Let's start with a 
little pseudocode to see what still needs to be done: 


it 
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# Get DNA 
# 
get file data 


exLract sequence from T4sta data 


# 

# Get the REBASE data into a hash, from file "bionet" 
# 

parseREBASE ('bionet'); 


for each user query 


If query is defined in the hash 
Get positions of query in DNA 


Report on positions, if any 


} 

You now need to write a subroutine that finds the positions of the query in the DNA. 
Remember that trick of putting a global search in a while loop from Example 5-7 
and take heart. No sooner said than: 

Given arguments Squery and Sdna 


while ( Sdna =~ /Squery/ig ) { 
save the position of the match 


} 


return @positions 

When you used this trick before, you just counted how many matches there were, not 
what the positions were. Let's check the documentation for clues, specifically the list of 
built-in functions in the documentation. It looks like the pos function will solve the 
problem. It gives the location of the last match of a variable in an m//g search. 
Example 9-3 shows the main program followed by the required subroutine. It's a 
simple subroutine, given the Perl functions like DOs that make it easy. 


Example 9-3. Make restriction map from user queries 


#!/usr/bin/perl 
# Make restriction map from user queries on names of 
restriction enzymes 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 
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my srebase hash = ( )7 

my @file data = ( ); 

my Squery = ''; 

my Sdna = ''; 

my Srecognition site = ''; 
my Sregexp = ''; 

my @locations = ( ); 


# Read in the file "sample.dna" 
Crile data = get file data ("sample.dna”™) ; 


# Extract the DNA sequence data from the contents of the 
file "sample.dna" 
edna = extract sequence from _fasta_data(@file data); 


# Get the REBASE data into a hash, from file "bionet" 
srebase hash = parseREBASE ("bionet”’); 





# Prompt user for restriction enzyme names, create 
restriction map 
do { 

print "Search for what restriction site for (or quit)?: 


We 
Lf 


Squery = <STDIN>; 
chomp Squery; 


# Exit if empty query 
if ($query =~ /*\s*S$/ ) { 


exit; 


} 


# Perform the search in the DNA sequence 
if ( @eists Srebase hashiSquery} ) { 
(SPecogni tion site, $régexp) = eplit ¢( * *, 
Srebase hash{S$query}); 











# Create the restriction map 
@locations = match positions ($regexp, $dna); 











# Report the restriction map to the user 
if (@locations) { 
print “Searching for Squery Srecognition site 
Sregexp\n"; 
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print "A restriction site for Squery at 
lecationg: \n"; 
print join(’ ", Clocations), “\n"; 
} else { 
print 
the DNAs \n"; 
} 





"A restriction site for Squery is not in 


} 
ering. ™n"7 
} until ( $query =~ /quit/ ); 


exit; 


HH Ht HH EH HE HE EH EE EE EE EEE EEE EE EEE EEE EEE OEE EEE EEE EEE EOE 
tH Ht HH EH HEH HE HE EH EE HE 

# 

# Subroutine 

# 

# Find locations of a match of a regular expression ina 
StEING 

# 

# 

# return an array of positions where the regular expression 
# appears in the string 


# 

sub match positions { 
my (Sregexp, Ssequence) = @ ; 
use strict; 


use BeginPerlBioinfo; # see Chapter 6 about this 
module 





# 
# Declare variables 


# 

my @positions = ( ); 

# 

# Determine positions of regular expression matches 


it 


while ( Ssequence =~ /Sregexp/ig ) { 
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push ( @positions, pos(Ssegquence) - length(S&) + 1); 
i 


return @positions; 
} 
Here is some sample output from Example 9-3: 
Search for what restriction enzyme (or quit) ?: Acel 
Searching for AceI G*CWGC GC[AT]GC 
A restriction site for AceI at locations: 
54 94 582 660 696 702 840 855 957 





Search for what restriction enzyme (or quit) ?: AccII 
Searching for AccII CG*CG CGCG 

A restriction site for AccII at locations: 

ei 








Search for what restriction enzyme (or quit)?: Aael 
A restriction site for AaeI is not in the DNA: 


Search for what restriction enzyme (or quit)?: quit 

Notice the length ($&) in the subroutine match_positions. That $& is a special 
variable that's set after a successful regular-expression match. It stands for the sequence 
that matched the regular expression. Since POS gives the position of the first base 
following the match, you have to subtract the length of the matching sequences, plus 
one (to make the bases start at position 1 instead of position 0) to report the starting 
position of the match. Other special variables include $* which contains everything in 
the string before the successful match; and $°, which contains everything in the string 
after the successful match. So, for example: '123456' =~ /34/ succeeds at setting 
these special variables like so: $*= '12',$&='34',and$° ='56'. 


What we have here is admittedly bare bones, but it does work. See the exercises at the 
end of the chapter for ways to extend this code. 


9.3 Perl Operations 


We've made it pretty far in this introductory programming book without talking about 
basic arithmetic operations, because you haven't really needed much more than addition 
to increment counters. 


However, an important part of any programming language, Perl included, is the ability to 
do mathematical calculations. Look at Appendix B, which shows the basic operations 
available in Perl. 


9.3.1 Precedence of Operations and Parentheses 


Operations have rules of precedence. These enable the language to decide which 
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operations should be done first when there are a few of them in a row. The order of 
operations can change the result, as the following example demonstrates. 


Say you have the code 8 + 4 / 2. If you did the division first, you'd get 8 + 2, or 
10.However, if you did the addition first, you'd get 12 / 2, or 6. 


Now programming languages assign precedences to operations. If you know these, you 
can write expressions such as 8 + 4 / 2, and you'd know what to expect. But this is a 


slippery slope. 


For one thing, what if you get it wrong? Or, what if someone else has to read the code 
who doesn't have the memorization powers you do? Or, what if you memorize it for one 
language and Perl does it differently? (Different languages do indeed have different 
precedence rules.) 


There is a solution, and it's called using parentheses. For Example 9-3, if you 
simply add parentheses: (8 + ( 4 / 2 )), it's clear to you, other readers, and the Perl 
program, that you want to do the division first. Note that "Inner" parentheses, contained 
within another pair of parentheses, are evaluated first. 


Remember to use parentheses in complicated expressions to specify the order of 
operations. Among other things, it will save you some long debugging sessions! 


9.4 Exercises 


Exercise 9.1 


Modify Example 9-3 to accept DNA from the command line; if it's not 
specified there, prompt the user for a FASTA filename and read in the DNA 
sequence data. 


Exercise 9.2 


Modify Exercise 9.1 to read in, and make a hash of, the entire REBASE 
restriction site data from the bionet file. 


Exercise 9.3 


Modify Exercise 9.2 to store the REBASE hash created in a DBM file if it doesn't 
exist or to use the DBM file if it does exist. (Look ahead to Chapter 10 for 
more information about DBM.) 


Exercise 9.4 


Modify Example 5-3 to report on the locations of the motifs that it finds, even 
if motif appears multiple times in the sequence data. 


Exercise 9.5 


Include a graphic display of the cut sites in the restriction map by printing the 
sequence and labeling the recognition sites with the enzyme name. Can you make 
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a map that handles multiple restriction enzymes? How can you handle 
overlapping restriction sites? 


Exercise 9.6 


Write a subroutine that returns a restriction digest, the fragments of DNA left after 
performing a restriction reaction. Remember to take into account the location of 
the cut site. (This requires you to parse the REBASE Dionet in a different 
manner. You may, if you wish, ignore restriction enzymes that are not given with 
a“ indicating a cut site.) 


Exercise 9.7 


Extend the restriction map software to take into account the opposite strand for 
nonpalindromic recognition sites. 


Exercise 9.8 


IT-SC 


Given an arithmetic expression without parentheses, write a subroutine that adds 
the appropriate parentheses to conform to Perl's precedence rules. (Warning: this 
is a pretty hard exercise and should be skipped by all but the true believers who 
have extra time on their hands. See the Perl documentation for the precedence 
rules.) 
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Chapter 10. GenBank 


GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of 
known genetic sequences from a variety of organisms. Its use is central to modern 
biology and to bioinformatics. 


This chapter shows you how to write Perl programs to extract information from GenBank 
files and libraries. Exercises include looking for patterns; creating special libraries; and 
parsing the flat-file format to extract the DNA, annotation, and features. You will learn 
how to make a DBM database to create your own rapid-access lookups on selected data 
in a GenBank library. 


Perl is a great tool for dealing with GenBank files. It enables you to extract and use any 
of the detailed data in the sequence and in the annotation, such as in the FEATURES 
table and elsewhere. When I first started using Perl, I wrote a program that searched 
GenBank for all sequence records annotated as being located on human chromosome 22. 
I found many genes where that information was so deeply buried within the annotation, 
that the major gene mapping database, Genome Database (GDB), hadn't included them in 
their chromosome map. I think you'll discover the same feeling of power over the 
information when you start applying Perl to GenBank files. 


Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a 
BLAST search on some query sequence, and collect a set of GenBank files of related 
sequences as a result. Because the GenBank records are maintained by the individual 
scientists who discovered the sequences, if you find some new sequence of interest, you 
can publish it in GenBank. 


GenBank files have a great deal of information in them in addition to sequence data, 
including identifiers such as accession numbers and gene names, phylogenetic 
classification, and references to published literature. A GenBank file may also include a 
detailed FEATURES table that summarizes facts about the sequence, such as the location 
of the regulatory regions, the protein translation, and exons and introns. 


GenBank is sometimes referred to as a databank or data store, which is different from a 
database. Databases typically have a relational structure imposed upon the data, 
including associated indices and links and a query language. GenBank in comparison is a 
flat file, that is, an ASCII text file that is easily readable by humans.“ 


[1] GenBank is also distributed in ASN.1 format, for which you need specialized tools, provided 
by NCBI. 


From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen 
signs of strain during the growth. With a quickly advancing body of knowledge, 
especially one that's growing as quickly as genetic data, it's difficult for the design of a 
databank to keep up. Several reworkings of GenBank have been done, but the flat-file 
format—in all its frustrating glory—still remains. 
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Due to a certain flexibility in the content of some sections of a GenBank record, 
extracting the information you're looking for can be tricky. This flexibility is good, in that 
it allows you to put what you think is most important into the data's annotation. It's bad, 
because that same flexibility makes it harder to write programs that to find and extract the 
desired annotations. As a result, the trend has been towards more structure in the 
annotations. 


Since Perl's data structures and its use of regular expressions make it a good tool for 
manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using 
these features in Perl and building on the skills you've developed from previous chapters, 
you can write programs to access the accumulated genetic knowledge of the scientific 
community in GenBank. 


Since this is a beginning book that requires no programming experience, you should not 
expect to find the most finished, multipurpose software here. Instead you'll find a solid 
introduction to parsing and building fast lookup tables for GenBank files. If you've never 
done so, I strongly recommend you explore the National Center for Biotechnology 
Information (NCBI) at the National Institutes of Health (NIH) 
(http://www.ncbi.nim.nih.gov). While you're at it, stop by the European 
Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics arm 
of the European Molecular Biology Laboratory (EMBL) at http://www.embl- 
heidelberg.de/. These are large, heavily funded governmental bioinformatics 
powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics 
software. 


10.1 GenBank Files 


The primary repositories for genetic information are the NCBI GenBank, EMBL in 
Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information 
due to international cooperative agreements. Each entry or record in GenBank or its 
mirror sites may contain identifying, descriptive, and genetic information in ASCII- 
format files. Each record is written in a specific standard format, organized so that both 
humans and computer programs can extract the desired information with reasonable ease. 


Let's look at a relatively short GenBank record and at how the fields are defined, before 
writing any code. I'll save this information in a file called record.gb, for use in later 
programs. 
LOCUS AB031069 2487 bp mRNA PRI 
27-MAY-2000 
DEFINITION Homo sapiens PCCX1 mRNA for protein containing 
CXXC domain 1, 
complete cds. 
ACCESSION AB031069 
VERSION ABO31069.1 GI:8100074 
KEYWORDS 
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SOURCE Homo sapiens embryo male lung fibroblast 
cell Jine:hus-Ll2 ¢DNA to 
mRNA. 
ORGANISM Homo sapiens 
Eukaryota; Metazoa; Chordata; Craniata; 
Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; 
Hominidae; Homo. 
REFERENCE 1 (sites) 
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., 
Kishimoto,T., Imai,Si. and 
Takano,T. 
TITLE PCCX1, a novel DNA-binding protein with PHD 
finger and CXXC domain, 
is regulated by proteolysis 
JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 
(2000) 
MEDLINE 20261256 
REFERENCE 2 (bases 1 to 2487) 
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., 
Kishimoto,T., Imai,S. and 
Takano,T. 
TITLE Direct Submission 
JOURNAL Submitted (15-AUG-1999) to the 
DDBJ/EMBL/GenBank databases. 
Tadahiro Fujino, Keio University School of 
Medicine, Department of 
Microbiology; Shinanomachi 35, Shinjuku-ku, 
Tokyo 160-8582, Japan 
(E-mail:fujino@microb.med.keio.ac.jp, 
Tel:+81-3-3353-1211 (ex.62692), Fax:+81-3-5360- 



































1508) 
FEATURES Location/Qualifiers 
source 1..2487 
/organism="Homo sapiens" 
/db_xref="taxon: 9606" 
/sex="male" 
/eell line="Hus=L12" 
/cell_type="lung fibroblast" 
/dev_stage="embryo" 
gene 229522199 
/gene="PCCX1" 
CDS ZL9 ine LOY 
/gene="PCCX1" 





/note="a nuclear protein carrying a 
PHD finger and a CXXC 
domain" 


IT-SC 


227 


/codon_start=1 

/product="protein containing CXXC 
domain 1" 

/protein id="BAA96307.1" 

/db_xref="GI:8100075" 








/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD 











NCNEWFHGDCIRITEKMAKATREWYCRECREKDPKLETRYRHKKSRERDGNERDSSEP 








RDEGGGRKRPVPDPDLORRAGSGTGVGAMLARGSAS PHKSS POPLVAT PSQHHOQOOO 
QIKRSARMCGECEACRRTEDCGHCDF'CRDMKKFGGPNKIROKCRLROCOQLRARESYKY 
FPSSLSPVTPSESLPRPRRPLPTQOOQPOPSOKLGRIREDEGAVASSTVKEPPEATATP 
EPLSDEDLPLDPDLYQDFCAGAF DDHGLPWMSDTEES PFLDPALRKRAVKVKHVKRRE 
KKSEKKKEERYKRHROKOKHKDKWKH PERADAKDPASLPOCLGPGCVRPAQPSSKYCS 
DDCGMKLAANRIYETLPORTQOWQOS PC IABREHGKKLLERIRREQOSARTRLOQEMERR 
FHELEAT TLRAKOQAVREDEESNEGDSDDTDLOTFCVSCGHPINPRVALRHMERCYAK 
YESOTSFGSMY PTRIEGATRLFCDVYNPOSKTYCKRLOVLC PEHSRDPKVPADEVCGC 


PLVRDVFELTGDFCRLPKROCNRHY CWEKLRRAEVDLERVRVWYKLDELFEOQERNVRT 
AMTNRAGLLALMLHOTIQHDPLTTDLRSSADR" 

BASE COUNT 564 a VTLS: € 768 g 440 t 

ORIGIN 


pie 


agatggcggc gctgaggggt cttgggggct ctaggccggce 
cacctactgg tttgcagcgg 


61 agacgacgca tggggcctgc gcaataggag tacgctgcct 
gggaggcgtg actagaagcg 
121 gaagtagttg tgggcgcectt tgcaaccgcc tgggacgccg 
cegagiggtce tgtgcaggtt 
181 cgcegggtcge tggcgggggt cgtgagggag tgcgccggga 
gcggagatat ggagggagat 

241 ggttcagacc cagagcctcc agatgccggg gaggacagca 
agtccgagaa tggggagaat 

301 gcgeccatct actgcatctg ccgcaaaccg gacatcaact 
GCLicatGat cggqgiglgac 

361 aactgcaatg agtggttcca tggggactgc atccggatca 
ctgagaagat ggccaaggcc 

421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc 
ccaagctaga gattcgctat 
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481 cggcacaaga agtcacggga 
acagcagtga gccccgggat 

S41 gagggtggag ggcgcaagag 
tgcagcgccg ggcagggtca 

601 gggacagggg ttggggccat 
COeCCeCACad: aLeCECECccG 

661 cagcccttigqg 
agcagcagca gatcaaacgg 








tggccacacc 


gcgggatgge 
GeCLQECCCL 
GCLrUgCLeegG, 


cagccagcat 








JZ. teagcccgca 
ctgaggactg tggtcactgt 

781 gatttctgtc gggacatgaa 
agatccggca gaagtgccgg 

841 ctgcgccagt gccagctgcg 
ackUcCCCLEC CliLCgCUCtCa 

901 ccagtgacgc cctcagagtc 
cactgeccac ccaacagcag 

961 ccacagccat cacagaagtt 
agauguceds ggcgtcatca 
1021 acagtcaagg agcctcctga 
cactctcaga tgaggaccta 
1081 cctctggatc ctgacctgta 
ectttgatga ecatggccrg 
1141 ccctggatga gcgacacaga 
cegegctgcg gaagagggca 
1201 gtgaaagtga agcatgtgaa 
agaagaagaa ggaggagcga 
1261 tacaagcggc atcggcagaa 
ggaaacaccc agagagggct 
1321 gatgecaagg accctgcgtc 
Coguregtds Teg cceeges 


tgtgtggtga 
































gtgtgaggca 
gaagttcggg 


gGgGCCCOIGGGaa 


cctgccaagg 
agggcgcatc 
ggcetacagcec 
tcaggacttc 
agagrcceca 
gcgtcgggag 
gcagaagcac 


acegecccag 


aatgagcggg 





gatccagacc 
GICLCLOCUL 
caccagcagc 
totcggcgca 
ggccccaaca 
tcgtacaagt 
eccogccgqc 
cgtgaagatg 
acacctgagc 
tgtgcagggg 
kiectggacc 
aagaagtctg 
aaggataaat 


EGeceLoggGC 





1381 cagceccagct ccaagtattg 
agctggcage caaccgcatc 
1441 tacgagatce tcccccagcg 
gceccttgcat tgctgaagag 
1501 cacggcaaga agctgctcga 
agegegeee” cackcgectet 
1561 caggaaatgg aacgccgatt 
Eeceecilde Caagcagcag 
1621 gctgtgcgcg aggatgagga 
atgacacaga cctgcagatc 
1631 LECEGUGELE. cctgtgqggca 
ecttgcegcca catggagcgc 
1741 tgctacgcca agtatgagag 
tgtaccccac acgcattgaa 
1801 ggggccacac gactcttctg 
gcaaaacata ctgtaagcgg 
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ctcagatgac 
catccagcag 
acgcattcgc 
Cccatgagctt 
gagcaacgag 
CCCCatcaac 
ccagacgtcc 


tgatgtgtat 


tgtggcatga 


te 


tggcagcaga 





cgagagcagc 
gaggccatca 
ggtgacagtg 
ccacgtgttg 
tLuggqgucea 


aatcctcaga 
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1861 ctccaggtge tgtgccccga gcactcacgg gaccccaaag 
tgccagctga cgaggtatgc 
1921 gggtgcccce ttgtacgtga tgtctttgag ctcacgggtg 
actectgecd. ectgecccadg 
1981 cgccagtgqca atcgqecatta ctgc 
gtgcggaagt ggacttggag 

2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc 
aggagcgcaa tgtgcgcaca 

2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc 
accagacgat ccagcacgat 

2161 secectcacta cegacctrgcg cruccaqtgqcce cqaccactgag 
CCLCCLOGEG ‘caogacccercr 

2221 acaccctgca ttccagatgg ggqgagcecgcce cggtgceccgt 
gtgtccgttc ctccactcat 

2201. CUOEEECECE GgGLTLCLCCCE gigeccatcce accogrtigac 
cgqcecatctg cctttatcag 

2341. aggqgactate ccecgtcgaca tqtticagtge crogtgqggc 
tgcggagtcc act 


























fe 


tgggag aagctgcgge 





















































ECaLecur 

2401 gectcctctc cctgggtttt gttaataaaa ttttgaagaa 
accaaaaaaa aaaaaaaaaa 

2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa 











re 


Even if you're used to seeing GenBank files, it's worth taking the time to look one over, 
while considering how you would write a program to extract various parts of the data. For 
instance, how would you extract the sequence data? What's the format of the FEATURES 
table and its various subfields? 


There's a lot of information packed into a typical GenBank entry, and it's important to be 
able to separate the different parts. For instance, if you can extract the sequence, you can 
search for motifs, calculate statistics on the sequence, look for similarity with other 
sequences, and so forth. Similarly, you'll want to separate out—or parse—the various 
parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus 
and species, publications, etc. The FEATURES table part of the annotation can include 
specific information about the DNA, such as the locations of exons, regulatory regions, 
important mutations, and so on. 


The format specification of GenBank files and a great deal of other information about 
GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web 
site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt. 

gbrel.txt gives complete detail about the structure of GenBank files to help 
programmers, so you may want to refer to it as your searches become more complex. As 
a Perl programmer, you won't need all of the detail because you can parse data using 
regular expressions or the sp/it function. You need to get the data out and make it 
available to your programs. The code that accomplishes this task can be fairly simple, as 
you will see in this chapter. 


IT-SC 230 


10.2 GenBank Libraries 


GenBank is distributed as a set of libraries—flat files containing many records in 
succession.2! As of GenBank release 125.0, August 2001, there are 243 files, most of 
which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 
13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually 
distributed compressed, which means you can download somewhat smaller files, but you 
need to uncompress them after you received them. Uncompressed, this amounts to about 
50 GB of data. Since 1982, the number of sequences in GenBank has doubled about 
every 14 months. 


'2] The data is also distributed in the ASN.1 format. 
GenBank libraries are further organized into divisions by the classification of the 
sequences they contain, either phylogenetically or by sequencing technology. Here are 
the divisions: 
PRI: primate sequences 
ROD: rodent sequences 
MAM: other mammalian sequences 
VRT: other vertebrate sequences 
INV: invertebrate sequences 
PLN: plant, fungal, and algal sequences 
BCT: bacterial sequences 
VRL: viral sequences 
PHG: bacteriophage sequences 
SYN: synthetic and chimeric sequences 
UNA: unannotated sequences 
EST: EST sequences (expressed sequence tags) 
PAT: patent sequences 
STS: STS sequences (sequence tagged sites) 
GSS: GSS sequences (genome survey sequences) 
HTG: HTGS sequences (high throughput genomic sequencing data) 
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HTC: HTC sequences (high throughput cDNA sequencing data) 


Some divisions are very large: the largest, the EST, or expressed sequence tag division, is 
comprised of 123 library files! A portion of human DNA is stored in the PRI division, 
which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. 
Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data 
alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of 
sequence. 


The public database servers such as_ Entrez or BLAST at 
http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated 
sequence data and programs, but many researchers find that they need to write their own 
programs to manipulate and analyze the data. The problem is, there's so much data. For 
many purposes, you can download a selected set of records from NCBI or other locations, 
but sometimes you need the whole dataset. 


It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that 
contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data 
onto your hard drive, however, is more difficult. A Perl program called mirror.p/ helps to 
address this need. Downloading it, even with a university-standard, high-speed Internet 
connection can be time-consuming; downloading an entire dataset with a modem can be 
an exercise in frustration. The best solution is to download only the files you need, in 
compressed form. The EST data, for example, is about half the entire database; don't 
download it unless you really need to. If you need to download GenBank, I recommend 
contacting the help desk at NCBI. They'll help you get the most up-to-date information. 


Since you're learning to program, it makes more sense to practice on a tiny, five-record 
library file, but the programs you'll write will work just fine on the real files. 


10.3 Separating Sequence and Annotation 


In previous chapters you saw how to examine the lines of a file using Perl's array 
operations. Usually, you do this by saving the data in an array with each appearing as an 
element of the array. 


Let's look at two methods to extract the annotation and the DNA from a GenBank file. In 
the first method, you'll slurp the file into an array and look through the lines, as in 
previous programs. In the second, you'll put the whole GenBank record into a scalar 
variable and use regular expressions to parse the information. Is one approach better than 
the other? Not necessarily: it depends on the data. There are advantages and 
disadvantages to each, but both get the job done. 


I've put five GenBank records in a file called /ibrary.gb. As before, you can download 
the file from this book's web site. You'll use this datafile and the file record.gb in the 
next few examples. 


10.3.1 Using Arrays 
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Example 10-1 shows the first method, which operates on an array containing the lines 
of the GenBank record. The main program is followed by a subroutine that does the real 
work. 


Example 10-1. Extract annotation and sequence from GenBank file 


#!/usr/bin/perl 
# Extract annotation and sequence from GenBank file 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# declare and initialize variables 


my @annotation = ( ); 
my Ssequence = ''; 
my $filename = 'record.gb'; 





parsel(\@annotation, \Ssequence, $filename) ; 


# Print the annotation, and then 

# print the DNA in new format just to check if we got it 
okay. 

print @annotation; 





print sequence ($sequence, 50); 
exit; 


Hat H Ht at tH HE aE aE HE aE EH HE aE EH EE aE aE OE EE HE HE EEE HE aE EH OE EE EOE EE OE EE OH EE EE 
Hat HH Ht HH HH HH HH HH EH 

# Subroutine 

Hat Ht at tH HE a aE HH aE EH HE aE aE HE EE aE aE EE EE HE aE EE HE aE EH HE aE aE EOE EE aE EE EE HE EE 
Hat HH Ht Ht HH HH HH EH HE EH 


# parsel 
# 
# --parse annotation and sequence from GenBank record 
sub parsel { 
my(Sannotation, $dna, $filename) = @ ; 
# Sannotation--reference to array 


# Sdna --reference to scalar 
# Sfilename --scalar 
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# declare and initialize variables 
my Sin sequence = 0; 
my @GenBankFile = ( ); 


# Get the GenBank data into an array from a file 
@GenBankFile = get_file data($filename) ; 


# Extract all the sequence lines 
foreach my Sline (@GenBankFile) { 








if( Slane =~ /*\/\/\n/ }) 4. # If Sline is end-of- 
record line //\n, 
last; #break out of the foreach loop. 
} elsif( Sin sequence) { # If we know we're ina 
sequence, 
SSdna .= Sline; # add the current line to $Sdna. 
} elsif ( Sline =~ /*ORIGIN/ ) { # If Sline begins 
a sequence, 
Sin sequence = 1; # set the $in_sequence flag. 
} else{ # Otherwise 
push( @Sannotation, Sline); # add the current 
line to @annotation. 
} 
} 


# remove whitespace and line numbers from DNA sequence 
S$dna =~ s/[\s0-9]//g; 
} 
Here's the beginning and end of Example 10-1's output of the sequence data: 
agalrggcggcgctgaqggqqticttoggggctictaggecggccacctacigg 
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagtagttgtgggcgcectttgcaaccgcc 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 





CAGTICCCOUGLGLECOLTECCUCCacCTCatcrgliicticegqrticicect 
gtgcccatccaccggttgaccgcccatctgcctttatcagagggactgtc 
eccgtcgqacatgtticagtgcctggtogggceLtocggagtuccactcatcctL 
gcctcctctccctgggttttgttaataaaattttgaagaaaccaaaaaaa 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 

The foreach loop in subroutine parsel in Example 10-1 moves one by one 
through the lines from the GenBank file stored in the array @GenBankFile. It takes 
advantage of the structure of a GenBank file, which begins with annotation and runs until 
the line: 

ORIGIN 





is found, after which sequence appears until the end-of-record line // is reached. The 
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loop uses a flag variable $in sequence to remember that it has found the ORIGIN 
line and is now reading sequence lines. 


The foreach loop has a new feature: the Perl built-in function Last, which breaks out 
of the nearest enclosing loop. It's triggered by the end-of-record line //, which is reached 
when the entire record has been seen. 


A regular expression is used to find the end-of-record line. To correctly match the end-of- 
record (forward) slashes, you must escape them by putting a backslash in front of each 
one, so that Perl doesn't interpret them as prematurely ending the pattern. The regular 
expression also ends with a newline \/\/\n, which is then placed inside the usual 
delimiters: /\/\/\n/. (When you have a lot of forward slashes in a regular expression, 
you can use another delimiter around the regular expression and precede it with an m, 
thus avoiding having to backslash the forward slashes. It's done like so: m! //\n!). 


An interesting point about subroutine parsel is the order of the tests in the foreach 
loop that goes through the lines of the GenBank record. As you read through the lines of 
the record, you want to first gather the annotation lines, set a flag when the ORIGIN 
start-of-sequence line is found, and then collect the lines until the end-of-record // line is 
found. 


Notice that the order of the tests is exactly the opposite. First, you test 
for the end-of-record line, collect the sequence if the Sin sequence 
flag is set, and then test for the start-of-sequence ORIGIN line. Finally, you collect the 
annotation. 


The technique of reading lines one by one and using flag variables to mark what section 
of the file you're in, is a common programming technique. So, take a moment to think 
about how the loop would behave if you changed the order of the tests. If you collected 
sequence lines before testing for the end-of-record, you'd never get to the end-of-record 
test! 


Other methods of collecting annotation and sequence lines are possible, especially if you 
go through the lines of the array more than once. You can scan through the array, keeping 
track of the start-of-sequence and end-of-record line numbers, and then go back and 
extract the annotation and sequence using an array splice (which was described in the 
parseREBASE subroutine in Example 9-2). Here's an example: 

# find line numbers of ORIGIN and // in the GenBank record 


Slinenumber = 0; 
foreach my $line (@GenBankFile) { 
if ( Sline =~ /*//\n/ ) { # end-of-record // line 
Send = S$linenumber; 
Lasicy 
} elsif ( Sline =~ /*ORIGIN/ ) { # end annotation, 


begin sequence 
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Sorigin = Slinenumber; 
} 
Slinenumbert++; 


} 
# extract annotation and sequence with "array splice" 


@annotation = @GenBankFile 
@sequence = @GenBankFile 
</programlisting> 








[0..(Sorigin-1)]; 
[ (Sorigin+1l)..(Send-1) ]; 


10.3.2 Using Scalars 


A second way to separate annotations from sequences in GenBank records is to read the 
entire record into a scalar variable and operate on it with regular expressions. For some 
kinds of data, this can be a more convenient way to parse the input (compared to 
scanning through an array, as in Example 10-1). 


Usually string data is stored one line per scalar variable with its newlines, if any, at the 
end of the string. Sometimes, however, you store several lines concatenated together in 
one string that is, in turn, stored in a single scalar variable. These multiline strings aren't 
uncommon; you used them to gather the sequence from a FASTA file in Examples 
Example 6-2 and Example 6-3. Regular expressions have pattern modifiers that can 
be used to make multiline strings with their embedded newlines easy to use. 


10.3.2.1 Pattern modifiers 


The pattern modifiers we've used so far are /g, for global matching, and /i, for case- 
insensitive matching. Let's take a look at two more that affect the way regular expressions 
interact with the newlines in scalars. 


Recall that previous regular expressions have used the caret (*), dot (.), and dollar sign 
(S$) metacharacters. The * anchors a regular expression to the beginning of a string, by 
default, so that /“THE BEGUINE/ matches a string that begins with "THE BEGUINE". 
Similarly, $ anchors an expression to the end of the string, and the dot (.) matches any 
character except a newline. 


The following pattern modifiers affect these three metacharacters: 


The /s modifier assumes you want to treat the whole string as a single line, even with 
embedded newlines, so it makes the dot metacharacter match any character including 
newlines. 


The /m modifier assumes you want to treat the whole string as a multiline, with 
embedded newlines, so it extends the * and the $ to match after, or before, a newline, 
embedded in the string. 
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10.3.2.2 Examples of pattern modifiers 


Here's an example of the default behavior of caret (*), dot (.), and dollar sign (S$): 

use warnings; 

"AAC\nGTT" =~ /*.*S/; 

print Sc, “\n"; 

This demonstrates the default behavior without the /m or /s modifiers and prints the 
warning: 

Use of uninitialized value in print statement at line 3. 





The print statement tries to print $& , a special variable that is always set to the last 
successful pattern match. This time, since the pattern doesn't match, the variable $ & isn't 
set, and you get a warning message for attempting to print an uninitialized value. 


Why doesn't the match succeed? First, let's examine the *. *$ pattern. It begins with a %, 
which means it must match from the beginning of the string. It ends with a $, which 
means it must also match at the end of the string (the end of the string may contain a 
single newline, but no other newlines are allowed). The . * means it must match zero or 
more (*) of any characters (.) except the newline. So, in other words, the pattern * . *$ 
matches any string that doesn't contain a newline except for a possible single newline as 
the last character. But since the string in question, "ACC\nGTT" does contain an 
embedded newline \n that isn't the last character, the pattern match fails. 


In the next examples, the pattern modifiers /m and /s change the default 
behaviors for the metacharacters *, and $, and the dot: 


"BAC\nNGTI™ =a f°. *S ym; 
print Se, “\ni"; 


This snippet prints out AAC and demonstrates the /m modifier. The /m extends the 
meaning of the * and the $ so they also match around embedded newlines. Here, the 
pattern matches from the beginning of the string up to the first embedded newline. 


The next snippet of code demonstrates the /s modifier: 


"RAC HCTO™ =x (™.26/e3 
print 6&, "\n"; 


which produces the output: 


AAC 

GTT 

The /s modifier changes the meaning of the dot metacharacter so that it matches any 
character including newlines. With the /s modifier, the pattern matches everything from 
the beginning of the string to the end of the string, including the newline. Notice when it 
prints, it prints the embedded newline. 
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10.3.2.3 Separating annotations from sequence 


Now that you've met the pattern-matching modifiers and regular expressions that will be 
your main tools for parsing a GenBank file as a scalar, let's try separating the annotations 
from the sequence. 


The first step is to get the GenBank record stored as a scalar variable. Recall that a 
GenBank record starts with a line beginning with the word "LOCUS" and ends with the 
end-of-record separator: a line containing two forward slashes. 


First you want to read a GenBank record and store it in a scalar variable. There's a device 
called an input record separator denoted by the special variable $/ that lets you do 
exactly that. The input record separator is usually set to a newline, so each call to read a 
scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like 
SO: 


S/ = "//\n"; 


A call to read a scalar from a filehandle takes all the data up to the GenBank end-of- 
record separator. So the line $record = <GBFILE> in Example 10-2 stores the 
multiline GenBank record into the scalar variable $record. Later, you'll see that you 


can keep repeating this call in order to read in successive GenBank records from a 
GenBank library file. 


After reading in the record, you'll parse it into the annotation and sequence parts making 
use of /s and /m pattern modifiers. Extracting the annotation and sequence is the easy 
part; parsing the annotation will occupy most of the remainder of the chapter. 


Example 10-2. Extract annotation and sequence from Genbank record 


#!/usr/bin/perl 

# Extract the annotation and sequence sections from the 
First 

# record of a GenBank library 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 
my Sannotation = ''; 

my Sdna = ''; 

my Srecord = ''; 

my $filename = 'record.gb'; 

my $save_ input separator = $/; 





# Open GenBank library file 
unless (open(GBFILE, S$filename)) { 
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print "Cannot open GenBank file \"Sfilename\"\n\n"; 
exit; 


} 


# Set input separator to "//\n" and read in a record to a 
scalar 


sf = h/ Pn" 
Srecord = °<GBFILLE>? 


# reset input separator 
$/ = $save_input_ separator; 





# Now separate the annotation from the sequence data 
(Sannotation, S$dna) = (Srecord =~ 
J (uOCUS . "ORIGIN \S* \ni 8) We yy Wi eye 


# Print the two pieces, which should give us the same as 
the 

# original GenBank file, minus the // at the end 

print Sannotation, Sdna; 





exit; 


The output from this program is the same as the GenBank file listed previously, minus 
the last line, which is the end-of-record separator / /. 


Let's focus on the regular expression that parses the annotation and sequence out of the 
Srecord variable. This is the most complicated regular expression so far: 


Srecord = /* (LOCUS.*ORIGIN\s*\n) (.*)\/\/\n/s. 

There are two pairs of parentheses in the regular expression: (LOCUS. *ORIGIN\s*\n) 
and (.*). The parentheses are metacharacters whose purpose is to remember the parts of 
the data that match the pattern within the parentheses, namely, the annotation and the 
sequence. Also note that the pattern match returns an array whose elements are the 
matched parenthetical patterns. After you match the annotation and the sequence within 
the pairs of parentheses in the regular expression, you simply assign the matched patterns 
to the two variables Sannotation and $dna, like so: 

(Sannotation, S$dna) = (Srecord =~ 

/* (LOCUS .*ORIGIN\s*\n) (.*)\/\/\n/s) ; 

Notice that at the end of the pattern, we've added the /s pattern matching modifier, 
which, as you've seen earlier, allows a dot to match any character including an embedded 
newline. (Of course, since we've got a whole GenBank record in the S$record scalar, 
there are a lot of embedded newlines.) 


Next, look at the first pair of parentheses: 
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(LOCUS .*ORIGIN\s*\n) 


This whole expression is anchored at the beginning of the string by preceding it with a * 
metacharacter. (/s doesn't change the meaning of the * character in a regular expression.) 


Inside the parentheses, you match from where the string LOCUS appears at the beginning 
of the GenBank record, followed by any number of characters including newlines 
with .*, followed by the string ORIGIN, followed by possibly some whitespace with 
\s*, followed by a newline \n. This matches the annotation part of the GenBank record. 


Now, look at the second parentheses and the remainder: 


(ee 

This is easier. The .* matches any character, including newlines because of the /s 
pattern modifier at the end of the pattern match. The parentheses are followed by the end- 
of-record line, //, including the newline at the end, with the slashes preceded by 
backslashes to show that you want to match them exactly. They're not delimiters of the 
pattern matching operator. The end result is the GenBank record with the annotation and 
the sequence separated into the variables Sannotation and Ssequence. Although 
the regular expression I used requires a bit of explanation, the attractive thing about this 
approach is that it took only one line of Perl code to extract both annotation and sequence. 


10.4 Parsing Annotations 


Now that you've successfully extracted the sequence, let's look at parsing the annotations 
of a GenBank file. 


Looking at a GenBank record, it's interesting to think about how to extract the useful 
information. The FEATURES table is certainly a key part of the story. It has considerable 
structure: what should be preserved, and what is unnecessary? For instance, sometimes 
you just want to see if a word such as "endonuclease" appears anywhere in the record. 
For this, you just need a subroutine that searches for any regular expression in the 
annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has 
the necessary tools to make the operation successful. 


10.4.1 Using Arrays 


Example 10-3 parses a few pieces of information from the annotations in a GenBank 
file. It does this using the data in the form of an array. 


Example 10-3. Parsing GenBank annotations using arrays 


#!/usr/bin/perl 
# Parsing GenBank annotations using arrays 


use strict; 
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use warnings; 


use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 


my @genbank = ( ); 
my Slocus = ''; 

my Saccession = ''; 
my Sorganism = ''; 


# Get GenBank file data 
@genbank = get file data('record.gb'); 





# Let's start with something simple. Let's get some of the 


identifying 


# information, let's say the locus and accession number 


(here the same 





# thing) and the definition and the organism. 


for my Sline (@genbank) { 

if ($line =~ /*LOCUS/) { 
$line =~ s/*LOCUS\s*//; 
Slocus = Sline; 

}elsif ($line =~ /*ACCESSION/) { 
Sline =~ s/*ACCESSION\s*//; 
Saccession = $line; 

jelsif ($line =~ /* ORGANISM/) { 
Sline =~ s/*\s*ORGANISM\s*//; 











Sorganism = $line; 

} 
} 
print "*** LOCUS ***\n"; 
print Slocus; 
pring. "*** 2CCESSION Fen"; 
print Saccession; 
print “*** ORGANISM ***\n"; 
print Sorganiem; 
exit; 


Here's the output from Example 10-3: 
kKkK* LOCUS kKkK* 


AB031069 2487 bp mRNA PRI 
2000 

*** ACCESSION *** 

AB031069 


*** ORGANISM *** 
Homo sapiens 
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Now let's slightly extend that program to handle the DEFINITION field. Notice that the 
DEFINITION field can extend over more than one line. To collect that field, use a trick 
you've already seen in Example 10-1: set a flag when you're in the "state" of 
collecting a definition. The flag variable is called, unsurprisingly, $flag. 


Example 10-4. Parsing GenBank annotations using arrays, take 2 


#!/usr/bin/perl 
# Parsing GenBank annotations using arrays, take 2 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 


my @genbank = (_ ); 
my Slocus = ''; 

my Saccession = ''; 
my Sorganism = ''; 
my $definition = ''; 


my Sflag = 0; 


# Get GenBank file data 
@genbank = get_ file data('record.gb'); 


# Let's start with something simple. Let's get some of the 
identifying 

# information, let's say the locus and accession number 
(here the same 

# thing) and the definition and the organism. 





for my Sline (@genbank) { 














if ($line =~ /“hocus/) { 
Sline =~ s/*LOCUS\s*//; 
Slocus = Sline; 

jelsif ($line =~ /*DEFINITION/) { 
Sline =~ s/*DEFINITION\s*//; 
Sdefinition = Sline; 
Sflag = 1; 

jelsif ($line =~ /*ACCESSION/) { 
Sline =~ s/*ACCESSION\s*//; 
Saccession = Sline; 
Sflag = 0; 


}elsif (Sflag) { 
chomp ($definition) ; 
Sdefinition .= Sline; 

jelsif ($line =~ /* ORGANISM/) { 
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Sline =~ s/*\s*ORGANISM\s*//; 
Sorganism = Sline; 


} 








print "*** LOCUS ***\n"; 
print. Slocus; 

pring’ **** DEFINITION **\n"5 
print S$definition; 

pring. “**? 2CCESSION *F ni"; 
print Saccession; 

print “*** ORGANISM ***\n"; 
print Sorganism; 

exit; 


Example 10-4 outputs: 
K*K* LOCUS *K** 


AB031069 2487 bp mRNA PRI 27-MAY- 
2000 

eee DEE INITION 

Homo sapiens PCCX1 mRNA for protein containing CXXC domain 
ll, complete cds: 

wee. BROCE SS TON *** 

ABO31069 

*** ORGANISM *** 

Homo sapiens 





This use of flags to remember which part of the file you're in, from one iteration of a loop 
to the next, is a common technique when extracting information from files that have 
multiline sections. As the files and their fields get more complex, the code must keep 
track of many flags at a time to remember which part of the file it's in and what 
information needs to be extracted. It works, but as the files become more complex, so 
does the code. It becomes hard to read and hard to modify. So let's look at regular 
expressions as a vehicle for parsing annotations. 


10.4.2 When to Use Regular Expressions 


We've used two methods to parse GenBank files: regular expressions and looping 
through arrays of lines and setting flags. We used both methods to separate the annotation 
from the sequence in a previous section of this chapter. Both methods were equally well 
suited, since in GenBank files, the annotation is followed by the sequence, clearly 
delimited by an ORIGIN line: a simple structure. However, parsing the annotations 
seems a bit more complicated; therefore, let's try to use regular expressions to accomplish 
the task. 


To begin, let's wrap the code we've been working on into some convenient subroutines to 
focus on parsing the annotations. You'll want to fetch GenBank records one at a time 
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from a library (a file containing one or more GenBank records), extract the annotations 
and the sequence, and then if desired parse the annotations. This would be useful if, say, 
you were looking for some motif in a GenBank library. Then you can search for the motif, 
and, if found, you can parse the annotations to look for additional information about the 
sequence. 


As mentioned previously, we'll use the file /ibrary.gb, which you can download from 
this book's web site. 


Since dealing with annotation data is somewhat complex, let's take a minute to break our 
tasks into convenient subroutines. Here's the pseudocode: 


sub open file 
given the filename, return the filehandle 





sub get next record 
given the filehandle, get the record 
(we can get the offset by first calling "tell") 








sub get_annotation_and dna 
given a record, split it into annotation and cleaned-up 
sequence 


sub scarch sequence 
given a sequence and a regular expression, 
return array of locations of hits 


sub search annotation 
given a GenBank annotation and a regular expression, 
return array of locations of hits 








sub parse annotation 
separate out the fields of the annotation ina 
convenient form 


sub parse features 
given the features field, separate out the components 





The idea is to make a subroutine for each important task you want to accomplish and then 
combine them into useful programs. Some of these can be combined into other 
subroutines: for instance, perhaps you want to open a file and get the record from it, all in 
one subroutine call. 


You're designing these subroutines to work with library files, that is, files with multiple 
GenBank records. You pass the filehandle into the subroutines as an argument, so that 
your subroutines can access open library files as represented by the filehandles. Doing so 
enables you to have a get_ next record function, which is handy in a loop. Using 
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the Perl function ¢e// also allows you to save the byte offset of any record of interest, and 
then return later and extract the record at that byte offset very quickly. (A byte offset is 
just the number of characters into the file where the information of interest lies.) The 
operating system supports Perl in letting you go immediately to any byte offset location 
in even huge files, thus bypassing the usual way of opening the file and reading from the 
beginning until you get where you want to be. 


Using a byte offset is important when you're dealing with large files. Perl gives you built- 
in functions such as seek that allow you, on an open file, to go immediately to any 
location in the file. The idea is that when you find something in a file, you can save the 
byte offset using the Perl function fe//. Then, when you want to return to that point in the 
file, you can just call the Perl function seek with the byte offset as an argument. You'll 
see this later in this chapter when you build a DBM file to look up records based on their 
accession numbers. But the main point is that with a 250-MB file, it takes too long to find 
something by searching from the beginning, and there are ways of getting around it. 


The parsing of the data is done in three steps, according to the design: 


You'll separate out the annotation and the sequence (which you'll clean up by removing 
whitespace, etc., and making it a simple string of sequence). Even at this step, you can 
search for motifs in the sequence, as well as look for text in the annotation. 


Extract out the fields. 
Parse the features table. 


These steps seem natural, and, depending on what you want to do, allow you to parse to 
whatever depth is needed. 


Here's a main program in pseudocode that shows how to use those subroutines: 
open file 


while (-g6t next record ) 





get_annotation and dna 


if ( s€arch sequence for a motif AND 
search annotation for chromosome 22 ) 


parse annotation 
parse features to get sizes of exons, look for 
small sizes 
} 
} 


return accession numbers of records meeting the criteria 
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This example shows how to use subroutines to answer a question such as: what are the 
genes on chromosome 22 that contain a given motif and have small exons? 


10.4.3 Main Program 


Let's test these subroutines with Example 10-5, which has some subroutine definitions 
that will be added to the BeginPerlBioinfo.pm module: 


Example 10-5. GenBank library subroutines 


#!/usr/bin/perl 
# - test program of GenBank library subroutines 


use strict; 

use warnings; 

# Don't use BeginPerlBioinfo 

# Since all subroutines defined in this file 

# use BeginPerlBioinfo; # see Chapter 6 about this 
module 











# Declare and initialize variables 

my $fh; # variable to store filehandle 
my Srecord; 

my Sdna; 

my Sannotation; 

my Soffset; 

my S$library = 'library.gb'; 


# Perform some standard subroutines for test 
Sfh = open file ($library) ; 





Sorrser = tel l( Sih) + 





while( $record = get_next_record($fh) ) { 
(Sannotation, S$dna) = get annotation _and_dna(Srecord) ; 
if( search_sequence($dna, 'AAA[CG].')) { 


print "Sequence found in record at offset 
Soffset\n"> 
} 
if( search annotation(Sannotation, 'homo sapiens')) {f{ 
print "Annotation found in record at offset 
Soffset\n"; 
} 


Softset. = télil(Sih)?; 
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} 
exit; 


HEHEHE H HH HH HE aE aE aE AEE EEO OE HOE HEHEHE aE aE aE aE EE EEO EHH HEHE aE aE aE aE EE EEE 
HHPEEHEHH HHH HH EEE HEHE 
# Subroutines 


ttt aE aE a aE aE aE aE aE aE ae ae aE ee EP aaa EPP 
aa a a Ha HH 


# open file 
# 


# - given filename, set filehandle 
sub open file { 


my ($filename) = @ ; 
my Sfh; 


unless (open(Sfh, Sfilename)) { 
print "Cannot open file $filename\n"; 
exit; 
} 
return $fh; 
} 


# get next record 

# 

# - given GenBank record, get annotation and DNA 
sub gél mext record 7 


my($fh) = @ ; 


my (Soffset); 


my(Srecord) = ''; 
my(Ssave_ input separator) = $/; 
sf = ™/ fn"; 


Srecorad = <Sth>- 
$/ = Ssave_input_ separator; 


return Srecord; 
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# get _annotation and dna 
it 


# - given GenBank record, get annotation and DNA 
sub get annotation and dna { 


my (Srecord) = @ ; 


my(Sannotation) = ''; 


my(Sdna) = ; 
# Now separate the annotation from the sequence data 
(Sannotation, $dna) = (Srecord =~ 


/* (LOCUS. *ORIGIN\s*\n) (.*)\/\/\n/s); 


# Clean the sequence of any whitespace or / characters 


# (the / has to be written \/ in the character class, 
because 
# / is a metacharacter, so it must be "escaped" with 


\) 
Sdna =~ s/[\s\/]//g; 


return(Sannotation, S$dna) 


} 
# search sequence 
it 


# - search sequence with regular expression 


Sub search sequence { 





my (Ssequence, Sregularexpression) = @ ; 


my(@locations) = ( ); 





while( S$sequence =~ /Sregularexpression/ig ) { 
push( @locations, pos ); 


} 


return (@locations); 


} 


# search annotation 
# 


# - search annotation with regular expression 


sub search annotation { 


IT-SC 248 


my(Sannotation, Sregularexpression) = @ ; 


my(@locations) = ( ); 

# note the /s modifier--. matches any character 
including newline 

while( Sannotation =~ /Sregularexpression/isg ) { 


push( @locations, pos ); 


} 


return (€locations) ; 
} 
Example 10-5 generates the following output on our little GenBank library: 









































Sequence found in record at offset 0 
Annotation found in record at offset 0 
Sequence found in record at offset 6256 
Annotation found in record at offset 6256 
Sequence found in record at offset 12366 
Annotation found in record at offset 12366 
Sequence found in record at offset 17730 
Annotation found in record at offset 17730 
Sequence found in record at offset 22340 
Annotation found in record at offset 22340 
































The te// function reports the byte offset of the file up to the point where it's been read; so 
you want to first call te// and then read the record to get the proper offset associated with 
the beginning of the record. 


10.4.4 Parsing Annotations at the Top Level 


Now let's parse the annotations. 


There is a document from NCBI we mentioned earlier that gives the details of the 
structure of a GenBank record. This file is gbrel/.txt and is part of the GenBank release, 
available at the NCBI web site and their FTP site. It's updated with every release (every 
two months at present), and it includes notices of changes to the format. If you program 
with GenBank records, you should read this document and keep a copy around for 
reference use, and check periodically for announced changes in the GenBank record 
format. 


If you look back at the complete GenBank record earlier in this chapter, you'll see that the 
annotations have a certain structure. You have some fields, such as LOCUS, 
DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE, 
FEATURES, and BASE COUNT that start at the beginning of a line. Some of these 
fields have subfields, especially the FEATURES field, which has a fairly complex 
structure. 
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But for now, let's just extract the top-level fields. You will need a regular expression that 
matches everything from a word at the beginning of a line to a newline that just precedes 
another word at the beginning of a line. 


Here's a regular expression that matches our definition of a field: 


7 (se 2).* Ve nm 

What does this regular expression say? First of all, it has the /m pattern matching 
modifier, which means the caret * and the dollar sign $ also match around embedded 
newlines (not just at the beginning and end of the entire string, which is the default 
behavior). 


The first part of the regular expression: 
“(A=2) 2*\n 


matches a capital letter at the beginning of a line, followed by any number of characters 
(except newlines), followed by a newline. That's a good description of the first lines of 
the fields you're trying to match. 


The second part of the regular expression: 


(Ve. na 

matches a space or tab \s at the beginning of a line, followed by any number of 
characters (except newlines), followed by a newline. This is surrounded by parentheses 
and followed by a *, which means 0 or more such lines. This matches succeeding lines in 
a field, lines that start with whitespace. A field may have no extra lines of this sort or 
several such lines. 


So, the two parts of the regular expression combined match the fields with their optional 
additional lines. 


Example 10-6 shows a subroutine that, given the annotations section of a GenBank 
record stored in a scalar variable, returns a hash with keys equal to the names of the top- 
level fields and values equal to the contents of those fields. 


Example 10-6. Parsing Genbank annotation 


#!/usr/bin/perl 
# - test program for parse annotation subroutine 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 
my Sfh; 
my Srecord; 
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my Sdna; 

my Sannotation; 

my Sfields; 

my Slibrary = 'library.gb'; 


# Open library and read a record 
Sfh = open file ($library) ; 


Srecord = get_next_record($fh); 





# Parse the sequence and annotation 
(Sannotation, $dna) = get _annotation and dna($record); 





# Extract the fields of the annotation 
$fields = parse annotation (Sannotation); 


# Print the fields 

foreach my Skey (keys %fields) { 
print WkReKKK eR EK Skey KKK KKK KKK \ Ns 
print S$fields{Skey}; 

} 


exit; 


Hat Ht HH tH HE aE aE HE HE aE EE HE aE EH EE aE aE EE EE EE EE EE aE EH HE EE EOE EE aE EE EE HE 
Hat tH Hat tH HH HH HH EE HH OE HE 

# Subroutine 

Heat HH Ht Ht HH a EH HH aE EH HE aE aE HE EE aE aE EE EE EE EEE EE aE EH HE aE aE HE EE aE aE EE EE HE 
Hat tH Hat tH HE EH HE EH HE EH 


parse annotation 


keys: the field names 


# 
# 
# given a GenBank annotation, returns a hash with 
# 
# values: the fields 


sub parse annotation { 


my(Sannotation) = @ ; 

my(%results) = (_ ); 

while( Sannotation =~ /*[A-Z].*\n(*\s.¥*\n)*/gm ) { 
my Svalue = S&; 
(my $key = $value) =~ s/*([A-Z]+) .*/S$1/s; 


Sresults{Skey} = Svalue; 
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return Sresults; 
} 
In the subroutine parse_annotation, note how the variables $key and $value are 
scoped within the while block. One benefit of this is that you don't have to reinitialize 
the variables each time through the loop. Also note that the key is the name of the field, 
and the value is the whole field. 


You should take the time to understand the regular expression that extracts the field name 
for the key: 


(my Skey = $value) =~ s/*([A-Z]+) .*/S$1/s; 


This first assigns $key the value $value. It then replaces everything in $key (note the 
/s modifier for embedded newlines) with $1, which is a special variable pattern between 
the first pair of parentheses ([A-Z]+). This pattern is one or more capital letters 
(anchored to the beginning of the string, 1.e., the field name), so it sets $key to the value 
of the first word in the field name. 


You get the following output from Example 10-6 (the test just fetches the first record 
in the GenBank library): 


KKKKKKKK SOURCE KKK KKK KKK 


SOURCE Homo sapiens embryo male lung fibroblast 
cell line:Hus-Li2 cDNA to 
mRNA. 


ORGANISM Homo sapiens 
Eukaryota; Metazoa; Chordata; Craniata; 
Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; 
Hominidae; Homo. 
KER RER REX DEFINITION FORK RR RR RK 
DEFINITION Homo sapiens PCCX1 mRNA for protein containing 
CXXC domain 1, 
complete cds. 
KRKRKKREKE KEYWORDS KRKEKKKEKKK 























KEYWORDS . 

JK KR RK ER VERSION RRR KKK KX 

VERSION ABO031069.1 G1:8100074 

RRRKKRER* FEATURES RRKEKREKER XR 

FEATURES Location/Qualifiers 

source 1..2487 

/organism="Homo sapiens" 
/db_ xref="taxon: 9606" 
/sex="male" 
/cell_line="HuS-L12" 
/cell_type="lung fibroblast" 
/dev_stage="embryo" 
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gene 229 ¢e2199 
/gene="PCCX1' 
CDS 229% ge D9 
/gene="PCCX1" 
/note="a nuclear protein carrying a 
PHD finger and a CXXC 
domain" 
/codon_start=1 
/product="protein containing CXXC 








comin 1 
/protein id="BAA96307.1" 
/db_xref="GI:8100075" 








/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD 











NCNEWFHGDCIRITEKMAKATREWYCRECREKDPKLETRYRHKKSRERDGNERDSSEP 


RDEGGGRKRPVPDPDLORRAGSGTGVGAMLARGSAS PHKSS POPLVAT PSQHHOQOOO 


QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIROKCRLROCQLRARESYKY 


FPSSLSPVTPSESLPRPRRPLPTQOOQPOPSOKLGRIREDEGAVASSTVKEPPEATATP 


BEPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEES PFLDPALRKRAVKVKHVKRRE 


KKSEKKKEERYKRHROKOKHKDKWKH PERADAKDPAS LPOCLGPGCVRPAQPSSKYCS 


DDCGMKLAANRIYETLPORTQOWOQOS PC IAREHGKKLLERIRREQOSARTRLOEMERR 


FHELEAT TLRAKOQAVREDEESNEGDSDDTDLOTFCVSCGHPINPRVALRHMERCYAK 


YESOTSFGSMY PTRIEGATRLFCDVYNPOSKTYCKRLOVLC PEHSRDPKVPADEVCGC 





PLVRDVFELTGDFCRLPKROCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT 
AMTNRAGLLALMLHOTIQHDPLTTDLRSSADR" 
KREKRR ERK REFERENCE KKKKKKKKK 
REFERENCE 2 (bases 1 to 2487) 
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., 
Kishimoto,T., Imai,S. and 
Takano,T. 
TELE Direct Submission 
JOURNAL Submitted (15-AUG-1999) to the 
DDBJ/EMBL/GenBank databases. 
Tadahiro Fujino, Keio University School of 
Medicine, Department of 
Microbiology; Shinanomachi 35, Shinjuku-ku, 
Tokyo 160-8582, Japan 
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(E-mail:fujino@microb.med.keio.ac.jp, 
Tel :+81-3-3353-1211 (ex.62692), Fax:+81-3-5360- 








T503)) 

KkKKKKKEK ACCESSION ***** eK KX 

ACCESSION ABO031069 

KEKKKKKK TOCUG *RKKKKKKK 

LOCUS AB031069 2487 bp mRNA PRI 
27-MAY-2000 


KKKKKKKK ORIGIN KKK KKK KKK 





ORIGIN 
KkKEKKKKK BASE **kKKKKKK 
BASE COUNT 564 a TAS: 6 768 g 440 t 


As you see, the method is working, and apart from the difficulty of reading the regular 
expressions (which will become easier with practice), the code is very straightforward, 
just a few short subroutines. 


10.4.5 Parsing the FEATURES Table 


Let's take this one step further and parse the features table to its next level, composed of 
the source , gene, and CDS features keys. (See later in this section for a more 
complete list of these features keys.) In the exercises at the end of the chapter, you'll be 
challenged to descend further into the FEATURES table. 


To study the FEATURES table, you should first look over the NCBI gbrel.txt document 
mentioned previously. Then you should study the most complete documentation for the 
FEATURES table, available at 


http://www.ncbi.nim.nih.gov/collab/FT/index.html. 


10.4.5.1 Features 


Although our GenBank entry is fairly simple and includes only three features, there are 
actually quite a few of them. Notice that the parsing code will find all of them, because 
it's just looking at the structure of the document, not for specific features. 


The following is a list of the features defined for GenBank records. Although lengthy, I 
think it's important to read through it to get an idea of the range of information that may 
be present in a GenBank record. 
allele 

Obsolete; see variation feature key 
attenuator 

Sequence related to transcription termination 
C_region 


Span of the C immunological feature 
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CAAT_signal 

CAAT box in eukaryotic promoters 
COS 

Sequence coding for amino acids in protein (includes stop codon) 
conflict 

Independent sequence determinations differ 
D-loop 

Displacement loop 
D_segment 

Span of the D immunological feature 
enhancer 

Cis-acting enhancer of promoter function 
exon 

Region that codes for part of spliced mRNA 
gene 


Region that defines a functional gene, possibly including upstream (promoter, 
enhancer, etc.) and downstream control elements, and for which a name has been 
assigned 


GC_signal 

GC box in eukaryotic promoters 
iDNA 

Intervening DNA eliminated by recombination 
intron 

Transcribed region excised by mRNA splicing 
J_region 

Span of the J immunological feature 
LTR 

Long terminal repeat 
mat_peptide 

Mature peptide coding region (doesn't include stop codon) 
misc_binding 

Miscellaneous binding site 


misc_difference 


IT-SC 255 


Miscellaneous difference feature 
misc_feature 

Region of biological significance that can't be described by any other feature 
misc_recomb 

Miscellaneous recombination feature 
misc_RNA 

Miscellaneous transcript feature not defined by other RNA keys 
misc_signal 

Miscellaneous signal 
misc_structure 

Miscellaneous DNA or RNA structure 
modified_base 

The indicated base is a modified nucleotide 
mRNA 

Messenger RNA 
mutation 

Obsolete: see variation feature key 
N_region 

Span of the N immunological feature 
old_sequence 

Presented sequence revises a previous version 
polyA_signal 

Signal for cleavage and polyadenylation 
polyA_site 

Site at which polyadenine is added to MRNA 
precursor_RNA 

Any RNA species that isn't yet the mature RNA product 
prim_transcript 

Primary (unprocessed) transcript 
primer 

Primer binding region used with PCR 


primer_bind 
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Noncovalent primer binding site 
promoter 

A region involved in transcription initiation 
protein_bind 

Noncovalent protein binding site on DNA or RNA 
RBS 

Ribosome binding site 
rep_origin 

Replication origin for duplex DNA 
repeat_region 

Sequence containing repeated subsequences 
repeat_unit 

One repeated unit of a repeat_region 
rRNA 

Ribosomal RNA 
S_region 

Span of the S immunological feature 
satellite 

Satellite repeated sequence 
ScRNA 

Small cytoplasmic RNA 
sig_peptide 

Signal peptide coding region 
snRNA 

Small nuclear RNA 
Source 


Biological source of the sequence data represented by a GenBank record; 
mandatory feature, one or more per record; for organisms that have been 
incorporated within the NCBI taxonomy database, an associated 
/db_xref="taxon:NNNN" qualifier will be present (where NNNNN is the 
numeric identifier assigned to the organism within the NCBI taxonomy database) 


stem_loop 
Hairpin loop structure in DNA or RNA 
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SIS 


Sequence Tagged Site: operationally unique sequence that identifies the 
combination of primer spans used in a PCR assay 


TATA_signal 

TATA box in eukaryotic promoters 
terminator 

Sequence causing transcription termination 
transit_peptide 

Transit peptide coding region 
transposon 

Transposable element (TN) 
tRNA 

Transfer RNA 
unsure 

Authors are unsure about the sequence in this region 
V_region 

Span of the V immunological feature 
variation 


A related population contains stable mutation 


Placeholder (hyphen) 
-10_signal 
Pribnow box in prokaryotic promoters 
-35_signal 
-35 box in prokaryotic promoters 
3'clip 
3'-most region of a precursor transcript removed in processing 
3'UTR 
3' untranslated region (trailer) 
5'clip 
5'-most region of a precursor transcript removed in processing 


5'UTR 
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5' untranslated region (leader) 


These feature keys can have their own additional features, which you'll see here and in 
the exercises. 


10.4.5.2 Parsing 


Example 10-7 finds whatever features are present and returns an array populated with 
them. It doesn't look for the complete list of features as presented in the last section; it 
finds just the features that are actually present in the GenBank record and returns them 
for further use. 


It's often the case that there are multiple instances of the same feature in a record. For 
instance, there may be several exons specified in the FEATURES table of a GenBank 
record. For this reason we'll store the features as elements in an array, rather than in a 
hash keyed on the feature name (as this allows you to store, for instance, only one 
instance of an exon). 


Example 10-7. Testing subroutine parse_features 


#!/usr/bin/perl 
* = Min. Program €oO Test parse Teatures 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Declare and initialize variables 
my Sfh; 

my Srecord; 

my Sdna; 

my Sannotation; 

my sfields; 

my @features; 

my Slibrary = *lLibrary.qb'; 


# Get the fields from the first GenBank record in a library 
Sfh = open file ($library) ; 








Srecord = get_next_record($fh); 

(Sannotation, $dna) = get_annotation and dna($record); 
sfields = parse annotation (Sannotation) ; 

# Extract the features from the FEATURES table 


@features = parse features ($fields{ 'FEATURES'}); 
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# Print out the features 
foreach my $feature (@features) { 





# extract the name of the feature (or "feature key") 
my ($featurename) = (Sfeature =~ /* {5} (\S+)/); 





jean abe WkKk kkk KKK Sfeaturename KKK KKK KAA \ Ms 
print $feature; 


} 
exit; 


Hat H Ht HH Ht HH HH HH a EH EE aE EH EE aE aE HE EE EE EE HE aE EH OE aE aE OE EE EE HO EE 
Hat tH HH HH Ht HH HE HH 

# Subroutine 

Hat HH Ht Ht HH HH HH aE EH HE aE EH OE EE EE EE HO EE EO aE EH OE aE HE OE EE aE EE EE HE SE 
Hat HH HHH HH aE HH HH HH HF 


# parse features 


# 
# extract the features from the FEATURES field of a 
GenBank record 





sub parse features { 





my($features) = @ ; # entire FEATURES field in a 
scalar variable 


# Declare and initialize variables 
my(@features) = (); # used to store the individual 
features 


# Extract the features 

while( Sfeatures =~ /* {5}\S.*\n(* {21}\S.*\n)*/gm ) { 
my S$feature = $&; 

push (@features, $feature); 


} 





return @features; 


} 
Example 10-7 gives the output: 


KKKKKKKK source KKK KKK KKK 


source 1..2487 
/organism="Homo sapiens" 
/db_xref="taxon: 9606" 
/sex="male" 
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jeell line="Hus=L12" 
/cell_type="lung fibroblast" 
/dev_stage="embryo" 
RKKK KEKE EK gene RRRK KK RK EK 
gene 2A 9 5 cee 99 
/gene="PCCX1" 
KR K KEKE CDS BRK RRR RE 
CDS BAO ee 
/gene="PCCX1" 
/note="a nuclear protein carrying a 
PHD finger and a CXXC 
domain" 
/codon_start=1 
/product="protein containing CXXC 














domain 1" 
/protein id="BAA96307.1" 
/db xref="61:3100075" 





/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD 








NCNEWFHGDCIRITEKMAKATREWYCRECREKDPKLETRYRHKKSRERDGNERDSSEP 








RDEGGGRKRPVPDPDLORRAGSGTGVGAMLARGSAS PHKSS POPLVAT PSQHHOQOOO 
QI KRSARMCGECEACRRTEDCGHCDF'CRDMKKFGGPNKIROQKCRLROCOQLRARESYKY 
FPSSLSPVTPSESLPRPRRPLPTQOOQPOPSOKLGRIREDEGAVASSTVKEPPEATATP 
EPLSDEDLPLDPDLYQDFCAGAF DDHGLPWMSDTEES PFLDPALRKRAVKVKHVKRRE 
KKSEKKKEERYKRHROKOKHKDKWKHPERADAKDPASLPOCLGPGCVRPAQPSSKYCS 
DDCGMKLAANRIYETLPORTQOWQOOS PC IAREHGKKLLERIRREQOSARTRLOEMERR 
FHELEAT TLRAKOQAVREDEESNEGDSDDTDLOTFCVSCGHPINPRVALRHMERCYAK 
YESOTSFGSMY PTRIEGATRLFE'CDVYNPOSKTYCKRLOVLC PEHSRDPKVPADEVCGC 


PLVRDVFELTGDFCRLPKROCNRHY CWEKLRRAEVDLERVRVWYKLDELFEQERNVRT 
AMTNRAGLLALMLHOTIQHDPLTTDLRSSADR 

In subroutine parse features of Example 10-7, the regular expression that 

extracts the features is much like the regular expression used in Example 10-6 to parse 

the top level of the annotations. Let's look at the essential parsing code of Example 10- 

lke 








while( $features =~ /* {5}\S.*\n(* {21}\8.*\n)*/om ) { 


On the whole, and in brief, this regular expression finds features formatted with the first 
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lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces. 


First, note that the pattern modifier /m enables the * metacharacter to match after 
embedded newlines. Also, the {5} and {21} are quantifiers that specify there should be 
exactly 5, or 21, of the preceding item, which in both cases is a space. 


The regular expression is in two parts, corresponding to the first line and optional 
continuation lines of the feature. The first part * {5}\S.*\n means that the beginning 
of a line (*) has 5 spaces ({5}), followed by a non-whitespace character (\S) followed 
by any number of non-newlines (. *) followed by a newline (\n). The second part of the 
regular expression, (* {21}\S.*\n) * means the beginning of a line (*) has 21 spaces 
({21}) followed by a non-whitespace character (\S) followed by any number of non- 
newlines (.*) followed by a newline (\n); and there may be O or more such lines, 
indicated by the () * around the whole expression. 


The main program has a short regular expression along similar lines to extract the feature 
name (also called the feature key) from the feature. 


So, again, success. The FEATURES table is now decomposed or "parsed" in some detail, 
down to the level of separating the individual features. The next stage in parsing the 
FEATURES table is to extract the detailed information for each feature. This includes the 
location (given on the same line as the feature name, and possibly on additional lines); 
and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign 
and additional information of various kinds, possibly continued on additional lines. 


I'll leave this final step for the exercises. It's a fairly straightforward extension of the 
approach we've been using to parse the features. You will want to consult the 
documentation from the NCBI web site for complete details about the structure of the 
FEATURES table before trying to parse the location and qualifiers from a feature. 


The method I've used to parse the FEATURES table maintains the structure of the 
information. However, sometimes you just want to see if some word such as 
"endonulease" appears anywhere in the record. For this, recall that you created a 
search annotation subroutine in Example 10-5 that searches for any regular 
expression in the entire annotation; very often, this is all you really need. As you've now 
seen, however, when you really need to dig into the FEATURES table, Perl has its own 
features that make the job possible and even fairly easy. 


10.5 Indexing GenBank with DBM 


DBM stands for Database Management. Perl provides a set of built-in functions that give 
Perl programmers access to DBM files. 


10.5.1 DBM Essentials 


When you open a DBM file, you access it like a hash: you give it keys and it returns 
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values, and you can add and delete key-value pairs. What's useful about DBM is that it 
saves the key-value data in a permanent disk file on your computer. It can thus save 
information between the times you run your program; it can also serve as a way to share 
information between different programs that need the same data. A DBM file can get 
very big without killing the main memory on your computer and making your program— 
and everything else—slow to a crawl. 


There are two functions, dpbmopen and dbmclose, that "tie" a hash to a DBM file; 
then you just use the hash. As you've seen, with a hash, lookups are easy, as are 
definitions. You can get a list of all the keys from a hash called smy_hash by typing 
keys smy_ hash. You then can get a list of all values by typing values %my_hash. 
For large DBM files, you may not want to do this; the Perl function each allows you to 
read key-value pairs one at a time, thus saving the memory of your running program. 
There is also a de/ete function to remove the definitions of keys: 

delete Smy hash{'DNA'} 


entirely removes that key from the hash. 


DBM files are a very simple database. They don't have the power of a relational database 
such as MySQL , Oracle, or PostgreSQL ; however, it's remarkable how often a 
simple database is all that a problem really needs. When you have a set of key-value data 
(or several such sets), consider using DBM. It's certainly easy to use with Perl. 


The main wrinkle to using DBM is that there are several, slightly different DBM 
implementations—NDBM, GDBM, SDBM, and Berkeley DB. The differences are small 
but real; but for most purposes, the implementations are interchangeable. Newer versions 
of Perl give you Berkeley DB by default, and it's easy to get it and install it for your Perl 
if you want. If you don't have really long keys or values, it's not a problem. Some older 
DBMs require you to add null bytes to keys and delete them from values: 


Svalue = Smy_hash{"S$key\0"}; 
chop $value; 


Chances are good that you won't have to do that. Berkeley DB handles long strings well 
(some of the other DBM implementations have limits), and because you have some 
potentially long strings in biology, I recommend installing Berkeley DB if you don't have 
it. 


10.5.2 A DBM Database for GenBank 


You've seen how to extract information from a GenBank record or from a library of 
GenBank records. You've just seen how DBM files can save your hash data on your hard 
disk between program runs. You've also seen the use of tell and seek to quickly 
access a location in a file. 


Now let's combine the three ideas and use DBM to build a database of information about 
a GenBank library. It'll be something simple: you'll extract the accession numbers for the 
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keys and store the byte offsets in the GenBank library of records for the values. You'll 
add some code that, given a library and an offset, returns the record at that offset, and 
write a main program that allows the user to interactively request GenBank records by 
accession number. When complete, your program should very quickly return a GenBank 
record if given its accession number. 


This general idea is extended in the exercises at the end of the chapter to a considerable 
extent; you may want to glance ahead at them now to get an idea of the potential power 
of the technique I'm about to present. 


With just the appropriate amount of further ado, here is a code fragment that opens 
(creating if necessary) a DBM file: 


unless (dbmopen(*smy hash, 'DBNAME', 0644)) { 


print "Cannot open DBM file DBNAME with mode 0644\n"; 
exit; 





} 

smy hash is like any other hash in Perl, but it will be tied to the DBM file with this 
statement. DBNAME is the basename of the actual DBM files that will be created. Some 
DBM versions create one file of exactly that name; others create two files with file 
extensions .dir and .pag. 


Another parameter is called the mode. Unix or Linux users will be familiar with file 
permissions in this form. Many possibilities exist; here are the most common ones: 
0644 
You can read and write; others can just read. 
0600 
Only you can read or write. 
0666 
Anyone can read or write. 
0444 
Anyone can read (nobody can write). 
0400 
Only you can read (nobody else can do anything). 
The dbmopen call fails if you try to open a file with a mode that assumes there are more 
permissions than were conferred on the DBM file when it was created. Usually, the mode 
0644 is declared by the owner if only the owner should be allowed to write, and 0444 is 


declared by readers. Mode 0666 is declared by the owner and others if the file is meant to 
be read or written by anyone. 
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That's pretty much it; DBM files are that simple. Example 10-8 displays a DBM file 
that stores key-value pairs of accession numbers of GenBank records for keys, and byte 
offsets of the records as values. 


Example 10-8. A DBM index of a GenBank library 


#!/usr/bin/perl 
# - make a DBM index of a GenBank library, 
# and demonstrate its use interactively 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 


# Declare and initialize variables 
my Sfh; 

my Srecord; 

my Sdna; 

my Sannotation; 

my Sfields; 

my %sdbm; 

my Sanswer; 

my Soffset; 

my Slibrary = 'library.gb'; 


# open DBM file, creating if necessary 

unless (dbmopen(%sdbm, 'GB', 0644)) { 
print "Cannot open DBM file GB with mode 0644\n"; 
exit; 





} 


# Parse GenBank library, saving accession number and offset 
in DBM file 
Sfh = open _file($library) ; 





Sotftset. =] tell (Sth): 





whilé { $réecord = get next réecord($ih) ) { 


7 Get accession field for thie record. 
(Sannotation, $dna) = get_annotation_and_dna(S$record) ; 


sfields = parse annotation (Sannotation) ; 


my Saccession = Sfields{'ACCESSION'}; 
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# extract just the accession number from the accession 
field 

# --remove any trailing spaces 

Saccession =~ s/*ACCESSION\s*//; 





Saccession =~ s/\s*S$//; 


# Store the key/value of accession/offset 
Sdbm{Saccession} = Soffset; 


# get offset for next record 
Soffset = tell (Sfh); 
} 


# Now interactively query the DBM database with accession 
numbers 
# to see associated records 





print "Here are the available accession numbers:\n"; 
pring join ({ “\n", keys Sdbm 3; in"? 
print "Enter accession number (or quit): "; 





while( Sanswer = <STDIN> ) { 
chomp Sanswer; 
if (Sanswer =~ /*\s*q/) { 
last; 
i 
Soffset = Sdbom{Sanswer}; 


if (Soffset) { 
seek (Sfh, Soffset, 0); 
Srecord = get_next_record($fh); 
print $record; 
jelse{ 
print "Do not have an entry for accession number 
Sanswer\n"; 


} 





print "\nEnter accession number (or quit): "; 


} 
dbomclose (%dbm) ; 


close(Sfh); 
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exit; 

Here's the truncated output of Example 10-8: 

Here are the available accession numbers: 

XM_ 006271 

NM 021964 

XM_ 009873 

ABO031069 

XM 006269 

Enter accession number (or quit): NM 021964 

LOCUS NM_ 021964 3032 bp mRNA PRI 
14-MAR-2001 

DEFINITION Homo sapiens zinc finger protein 148 (pHZ-52) 
(ZNF148), mRNA. 





2 (CD). iG 








fi 
Enter accession number (or quit): q 


10.6 Exercises 


Exercise 10.1 

Go to the NCBI, EMBL, and EBI web sites and become familiar with their use. 
Exercise 10.2 

Read the GenBank format documentation, gbrel. txt. 
Exercise 10.3 


Write a subroutine that passes a hash by value. Now rewrite it to pass the hash by 
reference. 


Exercise 10.4 


Design a module of subroutines to handle the following kinds of data: a flat file 
containing records consisting of gene names on a line and extra information of 
any sort on succeeding lines, followed by a blank line. Your subroutines should 
be able to read in the data and then do a fast lookup on the information associated 
with a gene name. You should also be able to add new records to the flat file. 
Now reuse this module to build an address book program. 


Exercise 10.5 


Descend further into the FEATURES table. Parse the features in the table into 
their next level by parsing the feature names, locations, and qualifiers. Check the 
document gbrel.txt for definitions of the structures of the fields. 

Exercise 10.6 


Write a program that takes a long DNA sequence as input and outputs the counts 
of all four-base subsequences (256 of them in all), sorted by frequency. A four- 
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base subsequence starts at each location 1, 2, 3, and so on. (This kind of word- 
frequency analysis is common to many fields of study, including linguistics, 
computer science, and music.) 


Exercise 10.7 


Extend the program in Exercise 10.6 to count all the sequences in a GenBank 
library. 


Exercise 10.8 


Given an amino acid, find the frequency of occurrence of the adjacent amino 
acids coded in a DNA sequence; or in a GenBank library. 


Exercise 10.10 


Extract all the words (excluding words like "the" or other unnecessary words) 
from the annotation of a library of GenBank records. For each word found, add 
the offset of the GenBank record in the library to a DBM file that has keys equal 
to the words, and values that are strings with offsets separated by spaces. In other 
words, one key can have a space-separated list of offsets for a value. Then you 
can quickly find all records containing a word like "fibroblast" with a simple 
lookup, followed by extracting the offsets and seeking into the library with those 
offsets. How big is your DBM file compared to the GenBank library? What might 
be involved in constructing a search engine for the annotations in all of GenBank? 
For human DNA only? 


Exercise 10.10 


Write a program to make a custom library of oncogenes from the GBPRI division 
of GenBank. 
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Chapter 11. Protein Data Bank 


The success of the Human Genome Project in decoding the DNA sequence of human 
genes has captured the public imagination, but another project has been quietly gaining 
momentum, and it promises equally revolutionary results. This project is an international 
effort to determine the 3D structure of a comprehensive range of proteins on a genome- 
wide level using high-throughput analytical technologies. This international effort is the 
foundation of the new field of structural genomics. 


Recent and expected advances in technology promise an accelerating pace of protein 
structure determination. The storehouse for all of this data is the Protein Data Bank 
(PDB). The PDB may be found on the web at http://www.rcsb.or 


Finding the amino acid or primary sequence is just the beginning of studying a protein. 
Proteins fold locally into secondary structures such as alpha helices, beta-strands, and 
turns. Two or three adjacent secondary structures might combine into common local folds 
called " motifs" or "supersecondary" structures such as beta sheets or alpha-alpha units. 
These building blocks then fold into the 3D or tertiary structure of a protein. Finally, one 
or more tertiary structures may be combined as subunits into a quaternary structure such 
as an enzyme or a virus. 


Without knowing how a protein folds into a 3D structure, you are less likely to know 
what the protein does or how it does it. Even if you know that the protein is implicated in 
a disease, knowledge of its tertiary structure is usually needed to find a possible treatment. 
Knowing the tertiary conformation of the active site of a protein (which may involve 
amino acids that are far apart in terms of the primary sequence but which are brought 
together by the folding of the protein) is critical to guide the selection of targets for new 
drugs. 


Now that the basic genetic information of a number of organisms, including humans, has 
been decoded, a primary challenge facing biologists is to learn as much as possible about 
the proteins those genes produce and how they interact. 


In fact, one of the great questions of modern biology is how the primary amino acid 
sequence of a protein determines its ultimate 3D shape. If a computational method can be 
found to reliably predict the fold of a protein from its amino acid sequence, the effect on 
biology and medicine would be profound. 


In this chapter, you'll learn the basics of PDB files and how to parse out selected 
information form them. You'll also explore interesting Perl techniques for finding and 
iterating over lots of files, as well as controlling other bioinformatics programs from a 
Perl program. The exercises at the end of the chapter challenge you to extend the 
introductory material presented here to gain access to more of the PDB data. 


11.1 Overview of PDB 
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The main source for information about 3D structures of macromolecules (including 
proteins, peptides, viruses, protein/nucleic acid complexes, nucleic acids, and 
carbohydrates) is PDB, and its format is the de facto standard for the exchange of 
structural information. Most of these structures are determined experimentally by means 
of X-ray diffraction or nuclear magnetic resonance (NMR) studies. 


PDB started in 1971 with seven proteins; it will soon grow to 20,000 structures. With the 
international effort in structural genomics increasing, the PDB is certain to continue its 
rapid growth. Within a few short years the number of known structures will approach 
100,000. 


PDB files are like GenBank records, in that they are human-readable ASCII flat files. The 
text conforms to a specific format, so computer programs may be written to extract the 
information. PDB is organized with one structure per file, unlike Genbank, which is 
distributed with many records in each "library" file. 


Bioinformaticians who work extensively with PDB files report that there are serious 
problems with the consistency of the PDB format. For instance, as the field has advanced 
and the data format has evolved to meet new knowledge requirements, some of the older 
files have become out of date, and efforts are underway to address the uniformity of PDB 
data. Until these efforts are complete and a new data format is developed, inconsistencies 
in the current data format are a challenge programmers have to face. If you do a lot of 
programming with PDB files, you'll find many inconsistencies and errors in the data, 
especially in the older files. Plus, many parsing tools that work well on newer files 
perform poorly on older files. 


As you become a more experienced programmer, these and other issues the PDB faces 
become more important. For instance, as PDB evolves, the code you write to interact 
with it must also evolve; you must always maintain your code with an eye on how the 
rest of the world is changing. As links between databases become better supported, your 
code will take advantage of the new opportunities the links provide. With new standards 
of data storage becoming established, your code will have to evolve to include them. 


The PDB web site contains a wealth of information on how to download all the files. 
They are also conveniently distributed—and at no cost—on a set of CDs, which is a real 
advantage for those lacking high-throughput Internet connections. 


11.2 Files and Folders 


The PDB is distributed as files within directories. Each protein structure occupies its own 
file. PDB contains a huge amount of data, and it can be a challenge to deal with it. So in 
this section, you'll learn to deal with large numbers of files organized in directories and 
subdirectories. 


You'll frequently find a need to write programs that manipulate large numbers of files. 
For example: perhaps you keep all your sequencing runs in a directory, organized into 
subdirectories labeled by the dates of the sequencing runs and containing whatever the 
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sequencer produced on those days. After a few years, you could have quite a number of 
files. 


Then, one day you discover a new sequence of DNA that seems to be implicated in cell 
division. You do a BLAST search (see Chapter 12) but find no significant hits for your 
new DNA. At that point you want to know whether you've seen this DNA before in any 
previous sequencing runs.“! What you need to do is run a comparison subroutine on each 
of the hundreds or thousands of files in all your various sequencing run subdirectories. 
But that's going to take several days of repetitive, boring work sitting at the computer 
screen. 


[1 You may do a comparison by keeping copies of all your sequencing runs in one large BLAST 
library; building such a BLAST library can be done using the techniques shown in this section. 


You can write a program in much less time than that! Then all you have to do is sit back 
and examine the results of any significant matches your program finds. To write the 
program, however, you have to know how to manipulate all the files and folders in Perl. 
The following sections show you how to do it. 


11.2.1 Opening Directories 


A filesystem is organized in a tree structure. The metaphor is apt. Starting from anyplace 
on the tree, you can proceed up the branches and get to any leaves that stem from your 
starting place. If you start from the root of the tree, you can reach all the leaves. Similarly, 
in a filesystem, if you start at a certain directory, you can reach all the files in all the 
subdirectories that stem from your starting place, and if you start at the root (which, 
strangely enough, is also called the "top") of the filesystem, you can reach all the files. 


You've already had plenty of practice opening, reading from, writing to, and closing files. 
I will show a simple method with which you can open a folder (also called a directory) 
and get the filenames of all the files in that folder. Following that, you'll see how to get 
the names of all files from all directories and subdirectories from a certain starting point. 


Let's look at the Perlish way to list all the files in a folder, beginning with some 
pseudocode: 


open folder 
read contents of folder (files and subfolders) 


print their names 


Example 11-1 shows the actual Perl code. 
Example 11-1. Listing the contents of a folder (or directory) 
#!/usr/bin/perl 


# Demonstrating how to open a folder and list its 
contents 
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use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





my @files = ( ); 
my Sfolder = 'pdb'; 





# open the folder 

unless (opendir(FOLDER, S$folder)) { 
print "Cannot open folder $folder!\n"; 
exit; 


} 


# read the contents of the folder (i.e. the files and 
subfolders) 
@files = readdir (FOLDER) ; 








# close the folder 
closedir (FOLDER) ; 


# print them out, one per line 
Print joint “\n", Criles), “wns 


exit; 


Since you're running this program on a folder that contains PDB files, this is what you'll 
see: 


3 
44 
pdbla4o.ent 


If you want to list the files in the current directory, you can give the directory name the 
special name "." for the current directory, like so: 


my Sfolder = '.'; 

On Unix or Linux systems, the special files "." and ".." refer to the current directory and 
the parent directory, respectively. These aren't "really" files, at least not files you'd want 
to read; you can avoid listing them with the wonderful and amazing grep function. grep 
allows you to select elements from an array based on a test, such as a regular expression. 
Here's how to filter out the array entries "." and "..": 

@files = grep( !/*\.\.2?S/, @files); 

grep selects all lines that don't match the regular expression, due to the negation 
operator written as the exclamation mark. The regular expression /*\ .\ .?S$/ is looking 
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for a line that begins with (the beginning of a line is indicated with the “ metacharacter) a 
period \. (escaped with a backslash since a period is a metacharacter) followed by 0 or | 
periods \.? (the ? matches 0 or 1 of the preceding items), and nothing more (indicated 
by the $ end-of-string metacharacter). 


In fact, this is so often used when reading a directory that it's usually combined into one 
step: 


@files = grep (!/%*\.\.?S/, readdir (FOLDER) ) ; 


Okay, now all the files are listed. But wait: what if some of these files aren't files at all 
but are subfolders? You can use the handy file test operators to test each filename and 
then even open each subfolder and list the files in them. First, some pseudocode: 


open folder 
for each item in the folder 


if it's a file 
print its name 


else if it's a folder 
open the folder 
print the names of the contents of the folder 





} 
} 


Example 11-2 shows the program. 


Example 11-2. List contents of a folder and its subfolders 


#!/usr/bin/perl 





# Demonstrating how to open a folder and list its 
CONTENTS 

# --distinguishing between files and subfolders, which 
# are themselves listed 





use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 


my @files = ( ); 
my Sfolder = 'pdb'; 





# Open the folder 

unless (opendir(FOLDER, $folder)) { 
print "Cannot open folder $folder!\n"; 
exit; 
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} 


# Read the folder, ignoring special entries "." and ".." 
@files = grep (!/*\.\.2?5/, readdir (FOLDER) ) ; 





closedir (FOLDER); 





If file, print its name 
If folder, print its name and contents 











sk HE HE 


# Notice that we need to prepend the folder name! 
foreach my $file (@files) { 


# If the folder entry is a regular file 
if (=F “Stolder/srile”) | 
print "Sfolder/Sfile\n"; 





If the folder entry is a subfolder 
}elsif( -d "Sfolder/S$file") { 





my S$folder = "Sfolder/Sfile"; 


# open the subfolder and list its contents 
unless (opendir(FOLDER, "Sfolder")) { 
print "Cannot open folder $folder!\n"; 
exit; 





} 
my @files = grep (!/*\.\.?S/, readdir (FOLDER) ) ; 
closedir (FOLDER) ; 


foreach my $file (@files) { 
print "Sfolder/Sfile\n"; 





} 
} 


exit; 
Here's the output of Example 11-2: 
pdb/3c/pdb43c9.ent 
pdb/3c/pdb43ca.en 
pdb/44/pdb144d.en 
pdb/44/pdb1441l.en 
pdb/44/pdb244d.en 
pdb/44/pdb2441.en 
pdb/44/pdb344d.en 
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pdb/44/pdb444d.ent 
pdb/pdbla4o.ent 


Notice how variable names such as $file and @files have been reused in this code, 
using lexical scoping in the inner blocks with my. If the overall structure of the program 
wasn't so short and simple, this could get really hard to read. When the program says 
Sfile, does it mean this $file or that $file? This code is an example of how to get 
into trouble. It works, but it's hard to read, despite its brevity. 


In fact, there's a deeper problem with Example 11-2. It's not well designed. By 
extending Example 11-1, it can now list subdirectories. But what if there are further 
levels of subdirectories? 


11.2.2 Recursion 


If you have a subroutine that lists the contents of directories and recursively calls itself to 
list the contents of any subdirectories it finds, you can call it on the top-level directory, 
and it eventually lists all the files. 


Let's write another program that does just that. A recursive subroutine is defined 
simply as a subroutine that calls itself. Here is the pseudocode and the code (Example 


11-3) followed by a discussion of how recursion works: 
subroutine list recursively 





open folder 
for each item in the folder 


if it's a file 
print its name 


else if it's a folder 
list recursively 


} 

Example 11-3. A recursive subroutine to list a filesystem 
#!/usr/bin/perl 

# Demonstrate a recursive subroutine to list a subtree of 
a filesystem 

use strict; 

use warnings; 


use BeginPerlBioinfo; # see Chapter 6 about this module 


list recursively ("pdb"); 
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exit; 


Heat H Ht Hat tH HE HH HH aE EH HE aE aE HE HE aE aE OE EE EE EE EE aE EH HE aE aE OE EE EE HO EE 
Hat HH HH HH HH HE HH HE EH 

# Subroutine 

Hat tt Hat tH HE EH HH aE EH HE aE EH HE aE aE EE EE EE EE EE aE aE HE aE aE EE EE EE EE HE 
Hat HH HH HH HH HH OH HH HE EH 


# list recursively 





# 

# list the contents of a directory, 

# recursively listing the contents of any 
subdirectories 


Sub lish Pecursively 4 
my (Sdirectory) = @ ; 
my @files = ( ); 
# Open the directory 
unless (opendir (DIRECTORY, Sdirectory)) { 


print "Cannot open directory S$directory!\n"; 
exit; 





} 
# Read the directory, ignoring special entries "." and 


@files = grep (!/%*\.\.?S/, readdir (DIRECTORY) ); 


closedir (DIRECTORY) ; 





# If file, print its name 
# If directory, recursively print its contents 








# Notice that we need to prepend the directory name! 
foreach my $file (@files) { 


# If the directory entry is a regular file 
if (-f "Sdirectory/S$file") { 


print "Sdirectory/$file\n"; 


# If the directory entry is a subdirectory 
Jelsif( -d "Sdirectory/S$file") { 
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# Here is the recursive call to this subroutine 
list. recursive ly ("Sdirectory/Stile”); 


} 
} 
Here's the output of Example 11-3 (notice that it's the same as the output of 
Example 11-2): 
pdb/3c/pdb43c9.en 
pdb/3c/pdb43ca.en 
pdb/44/pdb144d.en 
pdb/44/pdb1441.en 
pdb/44/pdb244d.en 
pdb/44/pdb2441.en 
pdb/44/pdb344d.en 
pdb/44/pdb444d.en 
pdb/pdbla4o.ent 
Look over the code for Example 11-3 and compare it to Example 11-2. As you can 
see, the programs are largely identical. Example 11-2 is all one main program; 
Example 11-3 has almost identical code but has packaged it up as a subroutine that is 
called by a short main program. The main program of Example 11-3 simply calls a 
recursive function, giving it a directory name (for a directory that exists on my computer; 
you may need to change the directory name when you attempt to run this program on 
your own computer). Here is the call: 
list recursively ("pdb"); 
I don't know if you feel let down, but I do. This looks just like any other subroutine call. 
Clearly, the recursion must be defined within the subroutine. It's not until the very end of 
the /ist_recursively subroutine, where the program finds (using the -d file test 
operator) that one of the contents of the directory that it's listing is itself a directory, that 
there's a significant difference in the code as compared with Example 11-2. At that 
point, Example 11-2 has code to once again look for regular files or for directories. 
But this subroutine in Example 11-3 simply calls a subroutine, which happens to be 
itself, namely, /ist_recursively: 
list. recursively ("Sdirectory/ efile"); 
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That's recursion. 


As you've seen here, there are times when the data—for instance, the hierarchical 
structure of a filesystem—is well matched by the capabilities of recursive programs. The 
fact that the recursive call happens at the end of the subroutine means that it's a special 
type of recursion called tail recursion. Although recursion can be slow, due to all the 
subroutine calls it can create, the good news about tail recursion is that many compilers 
can optimize the code to make it run much faster. Using recursion can result in clean, 
short, easy-to-understand programs. (Although Perl doesn't yet optimize it, current plans 
for Perl 6 include support for optimizing tail recursion.) 


11.2.3 Processing Many Files 
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Perl has modules for a variety of tasks. Some come standard with Perl; more can be 
installed after obtaining them from CPAN or elsewhere: http://www.CPAN.org/. 
Example 11-3 in the previous section showed how to locate all files and directories 
under a given directory. There's a module that is standard in any recent version of Perl 
called File:: Find. You can find it in your manual pages: on Unix or Linux, for instance, 
you issue the command perldoc File::Find. This module makes it easy—and 
efficient—to process all files under a given directory, performing whatever operations 
you specify. 

Example 11-4 uses File: : Find. Consult the documentation for more examples of this 
useful module. The example shows the same functionality as Example 11-3 but now 
uses File: :Find. It simply lists the files and directories. Notice how much less code you 
have to write if you find a good module, ready to use! 


Example 11-4. Demonstrate File::Find 


#!/usr/bin/perl 
# Demonstrate File::Find 


use Strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





Use: Filersrand, 
find < \emy sub, ("pdb") ) 


sub my sub { 
-f and (print S$File::Find::name, "\n"); 


} 
exit; 


Notice that a reference is passed to the my_sub subroutine by prefacing it with the 
backslash character. You also need to preface the name with the ampersand character, as 


mentioned in Chapter 6. 

The call to find can also be done like this: 

find sub { -f and (print SFile::Find::name, "\n") }, 

('pdb'); 

This puts an anonymous subroutine in place of the reference to the My_sub subroutine, 
and it's a convenience for these types of short subroutines. 

Here's the output: 

pdb/pdbla4o.ent 

pdb/44/pdb144d.ent 
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pdb/44/pdb1441.en 
pdb/44/pdb244d.en 
pdb/44/pdb2441.en 
pdb/44/pdb344d.en 
pdb/44/pdb444d.en 
pdb/3c/pdb43c9.en 
pdb/3c/pdb43ca.en 
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As a final example of processing files with Perl, here's the same functionality as the 
preceding programs, with a one-line program, issued at the command line: 





perl -e 'use File::Find;find sub{-f and (print 
SFile::Find: sname,"\n") }, ("pdb") * 








Pretty cool, for those who admire terseness, although it doesn't really eschew obfuscation. 
Also note that for those on Unix systems, 1s -R pdb and find pdb -print do the 
same thing with even less typing. 


The reason for using a subroutine that you define is that it enables you to perform any 
arbitrary tests on the files you find and then take any actions with those files. It's another 
case of modularization: the File::Find module makes it easy to recurse over all the files 
and directories in a file structure and lets you do as you wish with the files and directories 
you find. 


11.3 PDB Files 


Here's a section of an actual PDB file: 


HEADER SUGAR BINDING PROTEIN 03-MAR-99 
Cir 

TITLE LIGAND-FREE CONGERIN I 

COMPND MOL ID: 1; 


COMPND 2 MOLECULE: CONGERIN I; 

COMPND 3 CHAIN: A; 

COMPND 4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN; 
COMPND 5 BIOLOGICAL UNIT: HOMODIMER 

SOURCE MOL ID: 1; 

SOURCE 2 ORGANISM SCIENTIFIC: CONGER MYRIASTER; 
SOURCE 3 ORGANISM COMMON: CONGER EEL; 

SOURCE 4 TISSUE: SKIN MUCUS; 

SOURCE 5 SECRETION: NON-CLASSICAL 



































KEYWDS GALECTIN, LECTIN, BETA-GALACTOSE-BINDING, SUGAR 
BINDING 

KEYWDS 2 PROTEIN 

EXPDTA X-RAY DIFFRACTION 

AUTHOR 
T.SHIRAI,C.MITSUYAMA, Y.NIWA, Y.MATSUI,H.HOTTA, T. YAMANE, 
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AUTHOR 2 H.KAMIYA,C.ISHII,T.OGAWA, K.MURAMOTO 

REVDAT 2 14-OCT-99 1CI1F 1 SEQADV HEADER 
REVDAT i. 08-OCT-99 1C1F 0 

JRNL AUTH 
T.SHIRAI,C.MITSUYAMA, Y.NIWA, Y.MATSUI,H.HOTTA, 

JRNL AUTH 2 

T. YAMANE, H.KAMTYA,C.ISHII,T.OGAWA, K.MURAMOTO 

JRNL TITL HIGH-RESOLUTION STRUCTURE OF CONGER EEL 
GALECTIN, 

JRNL TITL 2 CONGERIN I, IN LACTOSE- LIGANDED AND 
LIGAND-FREE 

JRNL TITL 3 FORMS: EMERGENCE OF A NEW STRUCTURE 
CLASS BY 

JRNL TITL 4 ACCELERATED EVOLUTION 

JRNL REF STRUCTURE (LONDON) V. 7 
1223 1999 

JRNL REFN ASTM STRUE6 UK ISSN 0969-2126 

2005 

REMARK ab 

REMARK e 

REMARK 2 RESOLUTION. 1.6 ANGSTROMS. 

REMARK iS 

REMARK 3 REFINEMENT. 

REMARK 3 PROGRAM X=PLOR. ol 

REMARK 3 AUTHORS BRUNGER 

REMARK 3 

REMARK 3 DATA USED IN REFINEMENT. 

REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) 1.60 
REMARK 2 RESOLUTION RANGE LOW (ANGSTROMS) 8.00 
REMARK 3 DATA CUTOFF (SIGMA (F) ) 3.000 
REMARK 3 DATA CUTOFF HIGH (ABS (F) ) NULL 
REMARK 3 DATA CUTOFF LOW (ABS (F) ) NULL 
REMARK 3 COMPLETENESS (WORKING+TEST) (3) 85:60 
REMARK 3 NUMBER OF REFLECTIONS 17099 
REMARK 3 

REMARK 3 

REMARK 3 FIT TO DATA USED IN REFINEMENT. 

REMARK 5 CROSS-VALIDATION METHOD THROUGHOUT 
REMARK 3 FREE R VALUE TEST SET SELECTION RANDOM 
REMARK 3 R VALUE (WORKING SET) 0.207 
REMARK 3 FREE R VALUE 0.247 
REMARK 3 FREE R VALUE TEST SET SIZE (%) 5,000 
REMARK 3 FREE R VALUE TEST SET COUNT B55 
REMARK 3 ESTIMATED ERROR OF FREE R VALUE NULL 
REMARK 3 


(file truncated here) 
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REMARK 4 

REMARK 4 1C1F COMPLIES WITH FORMAT V. 2.3, O9-JULY-1998 
REMARK a 

REMARK 7 >>> WARNING: CHECK REMARK 999 CAREFULLY 
REMARK 8 

REMARK 8 SIDE-CHAINS OF SER123 AND LEU1L24 ARE MODELED AS 
ALTERNATIVE 

REMARK 8 CONFORMERS. 

REMARK 9 

REMARK 9 SER1 IS ACETYLATED. 

REMARK 10 

REMARK 10 TER 

REMARK 10 SER: THE N-TERMINAL RESIDUE WAS NOT OBSERVED 
REMARK 100 

REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 07-MAR- 
1999. 

REMARK 100 THE RCSB ID CODE IS RCSBO00566. 

REMARK 200 

REMARK 200 EXPERIMENTAL DETAILS 

REMARK 200 EXPERIMENT TYPE : A=RAY 
DIFFRACTION 

REMARK 200 DATE OF DATA COLLECTION >: NULL 

REMARK 200 TEMPERATURE (KELVIN) : 291.0 

REMARK 200 PH 2 -9..00 

REMARK 200 NUMBER OF CRYSTALS USED ei 

REMARK 200 

REMARK 200 SYNCHROTRON (Ye) ee 

REMARK 200 RADIATION SOURCE : PHOTON FACTORY 
REMARK 200 BEAMLINE : BLOA 

REMARK 200 X-RAY GENERATOR MODEL >: NULL 

REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M 

REMARK 200 WAVELENGTH OR RANGE (A) : 1.00 

REMARK 200 MONOCHROMATOR >: NULL 

REMARK 200 OPTICS >: NULL 

REMARK 200 


(file truncated here) 

















REMARK 500 

REMARK 500 GEOMETRY AND STEREOCHEMISTRY 

REMARK 500 SUBTOPIC: COVALENT BOND ANGLES 

REMARK 500 

REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING 
RESIDUES 

REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES 





BY MORE 
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REMARK 5 
C=CHAIN 
REMARK 5 
CODE). « 

REMARK 5 
REMARK 5 








OO THAN 


00 




















4*RMSD 


(M=MODEL NUMBER; RES=RESIDUE NAME; 





00 IDENTIFIER; SSEQ=SEQUENCE NUMBER; 








00 STANDARD TABLE: 





















































T=INSERTION 


REMARK 500 FORMAT: 

(10X,13,1X,A3,1X,A1,14,A1,3 (1X,A4,2X),12X,F5.1) 

REMARK 500 

REMARK 500 EXPECTED VALUES: ENGH AND HUBER, 1991 

REMARK 500 

REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3 

REMARK 500 HIS A 44 N = Pe ee OW ANGL. DEV. =- 
10.3 DEGREES 

REMARK 500 LEU A 132 CA = CB = €G ANGEL. DEV. = 
12.5 DEGREES 

REMARK 700 

REMARK 700 SHEET 

REMARK 700 DETERMINATION METHOD: AUTHOR-DETERMINED 

REMARK 999 

REMARK 999 SEQUENCE 

REMARK 999 LEU A 135 IS NOT PRESENT IN SEQUENCE DATABASE 
REMARK 999 

DBREF 1C1lF A 136 SWS P26788 LEG CONMY 

1 LSS 

SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 
999 

SEQRES A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP 
PHE THR VAL 

SEQRES 2A 136 GLY LYS PHE LEU THR VAL GLY GLY PHE ILE 
ASN ASN SER 

SEQRES 3 A 136 PRO GLN ARG PHE SER VAL ASN VAL GLY GLU 
SER MET ASN 

SEQRES 4 A 136 SER LEU SER LEU HIS LEU ASP HIS ARG PHE 
ASN TYR GLY 

SEQRES 5 A 136 ALA ASP GLN ASN THR ILE VAL MET ASN SER 
THR LEU LYS 

SEQRES 6 A 136 GLY ASP ASN GLY TRP GLU THR GLU GLN ARG 
SER THR ASN 

SEQRES 7 A 136 PHE THR LEU SER ALA GLY GLN TYR PHE GLU 
ILE THR LEU 

SEQRES 8 A 136 SER TYR ASP ILE ASN LYS PHE TYR ILE ASP 
ILE LEU ASP 

SEQRES 9 A 136 GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR 
SER LYS GLU 

SEQRES 10 A 136 PHE LEU PRO PHE LEU SER LEU ALA GLY ASP 
ALA ARG LEU 
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SEQRES 111A 136 THR LEU VAL LYS LEU GLU 

FORMUL 2 HOH FGI (HZ OL) 

HELIX ] 1 GLY A 66 ASN A 68 5 

S| 

SHEET J Sl 1 GLY A 3 VAL A 6 0 

SHEET ] S2 1 PHE A 12] GLY A 126 0 

SHEET 1 S3 1 ARG A 29 GLY A 35 0 

SHEET 1 S41 LEU A 41 ASN A 50 0 

SHEET 1 S5 1 GLN A 55 THRA 63 0 

SHEET 1 S61GLN A 74 SERA _ 76 0 

SHEET J Fl 1 ALA A128 GLU A136 0O 

SHEET J F2 1 PHE A 16 ILE A 23 O 

SHEET 1 F3 1TYRA 86 TYRA 93 0 

SHEET 1 F4i1 LYS A 97 ILE A 102 0 

SHEET 1 F5 1 ASN A107 PRO A 111 0 

CREST lL 94.340 36.920 40.540 90.00 90.00 90.00 P 21 

21 2 4 

ORIGX] 1.000000 0.000000 0.000000 0.00000 

ORIGX2 0.000000 1.000000 0.000000 0.00000 

ORIGX3 0.000000 0.000000 1.000000 0.00000 

SCALE1 0.010600 0.000000 0.000000 0.00000 

SCALE2 0.000000 0.027085 0.000000 0.00000 

SCALE3 0.000000 0.000000 0.024667 0.00000 

ATOM 1 N GLY A 2 1.888 =-8.251 =<2,.511 

1.00 36.63 N 

ATOM 2 CA GLY A 2 Zeofl =8.428 =1.248 
700 33.02 GC 

ATOM 3 ¢ GLY A Z 2.5986 =-7.069 =0.589 
700: 30,43 Cc 

ATOM 4 Oo GLY A Zz 22035 =6.107 =L.311 
.00 33.27 O 

ATOM 5 N GLY A 3 2.302 -6.984 0.693 
00 24.67 N 

ATOM 6 CA GLY A 3 2076. =e T2353 1.348 
00 18.88 ‘e 

ATOM y € GLY A 3 On FOO ~so¢42'6 1.526 
00 16.58 Cc 

ATOM 8 O GLY A 3 -0.187 -6.142 1.010 
00 12.47 O 

ATOM 9 N LEU A 4 0.494 -4.400 2.328 
/00° 155.00 N 

(file truncated here) 

ATOM 1078 CG GLU A 136 =0..8.73 9.368 16.046 

1.00 38.96 Cc 
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HETATM 
HETATM 


HETATM 

















HETATM 














(file truncated here) 


vO! 10. 


1072 


1080 


1081 


1082 


83 
84 


250 
1085 
200 202 
1086 
.00 38. 
1087 
.00 14. 
1088 
.00 24. 
1089 





Cae 


24 


50 


oi. 





49 
























































HETATM 1157 
IsO: Boao 2 

HETATM 1158 
1.00 38.91 

HETATM 1159 
TOG: S2g315 

HETATM 1160 
1.00 31.68 

HETATM 1161] 
1.00 46.10 

HETATM 1162 
1 WO: 565/82 

HETATM 1163 
00" 423230 

HETATM 1164 
1.00 47.13 

MASTER 

1 0 11 
END 


CD 


OE1 


OE2 


OXT 


O 


O 


240 


GLU 


GLU 


GLU 


GLU 





GLU 
HOH 


HOH 


HOH 


HOH 


HOH 








HOH 


HOH 


HOH 


HOH 


HOH 


HOH 


HOH 


HOH 








rPOPrPOPrOPrO 


© 


136 


136 


136 


L3G 





L368 





286 


207] 


288 


290 


291 


292 


293 


294 









































goes che, 9.054 17.456 
Os 189 8.749 17.64] 
= 2.236 9.099 13235 6 
0.764 12.146 12.712 
=1.905 =-/,624 25822 
-8.374 eels BAe 
-4.047 9.199 Jt W632 
6.172 14,210 8.483 
A IOS 74804 15.329 
16.654 0.676 11.968 
6.960 14.840 -3.025 
=3.222 10410 7s 0G1 
28 5306 0 95.1: 4.876 
Z1%O06 =12,474 De ha 
12.95) 10.424 =7.324 
13,119 =1o.184 14,793 
13.501 22.220 8.216 
LS 9Le. Hl 387 9695 
alia 0 0 6 1163 


PDB files are long, mostly due to the need for information about each atom in the 
molecule; this relatively short one, when complete, is extensive—28 formatted pages. I 
cut it here to a little over three pages, showing just enough of the principal sections to 
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give you the overall idea. 


The PDB web site has the basic documents you need to read and program with PDB files. 
The Protein Data Bank Contents Guide 
(http://www.rcsb.org/pdb/docs/format/pdbquide2.2/quide2.2 frame.h 
tml) is the best reference, and there are also FAQs and additional documents available. 


In the following sections, you'll extract information from these files. Since the 
information in these files describes the 3D structure of macromolecules, the files are 
frequently used by graphical programs that display a spatial representation of the 
molecules. The scope of this book does not include graphics; however, you will see how 
to get spatial coordinates out of the files. The largest part of PDB files are the ATOM 
record type lines containing the coordinates of the atoms. Because of this level of detail, 
PDB files are typically longer than GenBank records. (Note the inconsistent 
terminology—a unit of PDB is the file, which contains one structure; a unit of GenBank 
is the record, which contains one entry.) 


11.3.1 PDB File Format 


Let's take a look at a PDB file and the documentation that tells how the information is 
formatted in a PDB file. Based on that information, you'll parse the file to extract 
information of interest. 


PDB files are composed of lines of 80 columns that begin with one of several predefined 
record names and end with a newline. ("Column" means position on a line: the first 
character is in the first column, and so forth.) Blank columns are padded with spaces. A 
record type is one or more lines with the same record name. Different record types have 
different types of fields defined within the lines. They are also grouped according to 
function. 


The SEQRES record type is one of four record types in the Primary Structure Section, 
which presents the primary structure of the peptide or nucleotide sequence: 


DBREF 
Reference to the entry in the sequence database(s) 
SEQADV 
Identification of conflicts between PDB and the named sequence database 
SEQRES 
Primary sequence of backbone residues 
MODRES 
Identification of modifications to standard residues 


The DBREF and SEQADV record types in the example PDB entry from the previous 
section give reference information and details on conflicts between the PDB and the 
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original database. (The example doesn't include a MODRES record type.) Here are those 
record types from the entry: 














DBREF 1C1F A 1 136 SWS P26788 LEG CONMY 

1 135 

SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 
999 


Briefly, the DBREF line states there's a PDB file called 1C1F (from a file named 
pdbicif.ent), the residues in chain A are numbered from 1 to 136 in the original 
Swiss-Prot (SWS) database, the ID number P26788 and the name LEG _CONMY are 
assigned in that database (in many databases these are identical), and the residues are 
numbered 1 to 135 in PDB. The discrepancy in the numbering between the original 
database and PDB is explained in the SEQADV record type, which refers you to a 
REMARK 999 line (not shown here) where you discover that the PDB entry disagrees 
with the Swiss-Prot sequence concerning a leucine at position 135 (perhaps two different 
groups determined the structure, and they disagree at this point). 


'2] The cross-referencing to different databases is problematic in older PDB files: it may be 
missing, or buried somewhere in a REMARK 999 line. 


You can see that to parse the information in those two lines by a program requires several 
steps, such as following links to other lines in the PDB entry that further explain 
discrepancies and identifying other databases. 


Links between databases are important in bioinformatics. Table 11-1 displays the 

databases that are referred to in PDB files. As you already know, there are many 

biological databases; those shown here have a good deal of protein or structural data. 
Table 11-1. Databases referenced in PDB files 


Database PDB code 
BioMagResBank BMRB 
BLOCKS BLOCKS 
European Molecular Biology Laboratory EMBL 
GenBank GB 
Genome Data Base GDB 
Nucleic Acid Database NDB 
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PROSITE PROSIT 


Protein Data Bank PDB 
Protein Identification Resource PIR 
SWISS-PROT SWS 
TREMBL TREMBL 


11.3.2 SEQRES 


For starters, let's try a fairly easy task in Perl: extracting the amino acid sequence data. To 
extract the amino acid primary sequence information, you need to parse the record type 
SEQRES. Here is a SEQRES line from the PDB file listed earlier: 


SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP 
PHE THR VAL 





The following code shows the SEQRES record type as defined in the Protein Data Bank 
Contents Guide. This section on SEQRES, which is a fairly simple record type, is shown 
in its entirely to help familiarize you with this kind of documentation. 


SEQRES 
Overview 


SEQRES records contain the amino acid or nucleic acid 
sequence of residues in 

each chain of the 

macromolecule that was studied. 





Record Format 


COLUMNS DATA TYPE FIELD DEFINITION 
i) = 1G Record name "SEQRES" 
Oo 10) Integer serNum Serial number 


of the SEQORES record 
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current chain. 


Starts at l 


by one each line. 


each chain. 


12 


identifier. 


character, 


Character 
This may be any 


including a 


used if there is 


14 - 


Ly 


Integer 


residues in the chain. 


repeated on every 


32 = 


36 - 


40 - 


44 - 


48 - 


S20 = 


516 = 


60 - 


64 - 


68-= 
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22 


26 


30 


34 


38 


42 


46 


50 


54 


58 


62 


66 


70 





Residue 


ame 





Residue 


Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 


ame 





Residue 





ame 





ame 


chainID 


numRes 


resName 


resName 


resName 


resName 


resName 


resName 


resName 


resName 


resName 


resName 


resName 


resName 








resName 


for the 


and increments 


Reset to 1 for 


Chain 





Single legal 
blank which is 
only one chain. 
Number of 

This value is 


record. 





Residue name. 





Residue name. 


Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 





Residue name. 











Residue name. 
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Details 





* PDB entries use the three-letter abbreviation for amino 
acid names and the 
one-letter code for nucleic acids. 


* In the case of non-standard groups, a hetID of up to 
three (3) alphanumeric 

characters is used. Common HET names appear in the HET 
dictionary. 


* Bach covalently contiguous sequence of residues 
(connected via the "backbone" 
atoms) is represented as an individual chain. 








* Heterogens which are integrated into the backbone of the 
chain are listed as 

being part of the chain and are included in the SEQRES 
records for that chain. 


* Fach set of SEQRES records and each HET group is assigned 
a component number. 

The component number is assigned serially beginning with 
1 for the first set 

of SEQRES records. This number is given explicitly in the 
FORMUL record, but 

only implicitly in the SEQRES record. 











* The SEQRES records must list residues present in the 
molecule studied, even 
if the coordinates are not present. 


* C- and N-terminus residues for which no coordinates are 
provided due to 
disorder must be listed on SEQRES. 


* All occurrences of standard amino or nucleic acid 
residues (ATOM records) 

must be listed on a SEQRES record. This implies that a 
numRes of 1 is valid. 





* No distinction is made between ribo- and 
deoxyribonucleotides in the SEQRES 

records. These residues are identified with the same 
residue name (i.e., A, 

Cy Gy Te. Uy 2). 














IT-SC 289 





* If the entire residue sequence is unknown, the serNum in 
columm 10. 26 "0"; 
the number of residues thought to comprise the molecule 
is entered as numRes 
in columns 14 - 17, and resName in columns 20 - 22 is 
"UNK" ; 











* In case of microheterogeneity, only one of the sequences 
is presented. A 

REMARK is generated to explain this and a SEQADV is also 
generated. 


Verification/Validation/Value Authority Control 





The residues presented on the SEQRES records must agree 
with those found in 
the ATOM records. 


The SEQRES records are checked by PDB using the sequence 
databases and 
information provided by the depositor. 





SEQRES is compared to the ATOM records during processing, 
and both are checked 

against the sequence database. All discrepancies are either 
resolved or 

annotated in the entry. 





Relationships to Other Record Types 





The residues presented on the SEQRES records must agree 
with those found in 

the ATOM records. DBREF refers to the corresponding entry 
in the sequence 

databases. SEQADV lists all discrepancies between the 
entry's sequence for 

which there are coordinates and that referenced in the 
sequence database. 

MODRES describes modifications to a standard residue. 








Example 


1 2 3 4 5 
6 7 
12345678901234567890123456789012345678901234567890123456789 
01234567890 
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SEQRES 1A 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE 
CYS SER LEU 
SEQRES 2A 21 TYR GLN LEU GLU ASN TYR CYS ASN 

SEQRES 1B 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS 
LEU VAL GLU 
SEQRES 2B 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY 
PoE PoE. “CYR 

SEQRES 3 B 30 THR PRO LYS ALA 

SEQRES 1G 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE 
CYS SER LEU 
SEQRES Ae 21 TYR GLN LEU GLU ASN TYR CYS ASN 

SEQRES 1 D 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS 
LEU VAL GLU 
SEQRES 2 D 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY 
PHE: PHE TYR 
SEQRES 3 D 30 THR PRO LYS ALA 
































Known Problems 


Polysaccharides do not lend themselves to being represented 
in SEQRES. 


There is no mechanism provided to describe sequence runs 
when the exact 
ordering of the sequence is not known. 





For cyclic peptides, PDB arbitrarily assigns a residue as 
the N-terminus. 


For microheterogeneity only one of the possible residues in 
a given position 
is provided in SEQRES. 





No distinction is made between ribo- and 
deoxyribonucleotides in the SEQRES 

records. These residues are identified with the same 
residue name (i.e., A, 

Cw Gp, Le VU) 














The structure of the line containing the SEQRES record type is fairly straightforward, 
with fields assigned to specific locations or columns in the line. You'll see later how to 
use these locations to parse the information. Note that the documentation includes many 
details that arise when handling such complex experimental data. 


Apart from the fairly standard problem of accumulating the sequence, there is the added 
complication of multiple strands. By reading the documentation just shown, you'll see 
that the SEQRES identifier is followed by a number representing the line number for that 
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chain, and the chain is given in the next field (although in older records it was optional 
and may be blank). Following those fields comes a number that gives the total number of 
residues in the chain. Finally, after that, come residues represented as three-letter codes. 
What is needed, and what can be ignored to meet our programming goals? 


11.4 Parsing PDB Files 


First, Example 11-5 shows the main program and three subroutines that will be 
discussed in this section. 


Example 11-5. Extract sequence chains from PDB file 


#!/usr/bin/perl 
# Extract sequence chains from PDB file 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 


# Read in PDB file: Warning - some files are very large! 
my Efile = gét_ file data ("*pdb/cl/pdblelf.ent"); 








# Parse the record types of the PDB file 
my Srecordtypes = parsePDBrecordtypes (@file); 











# Extract the amino acid sequences of all chains in the 
protein 
my @chains = extractSEQRES( Srecordtypes{'SEQRES'} ); 





# Translate the 3-character codes to l-character codes, and 
print 
foreach my S$chain (@chains) { 
ering. “***sohain Gohan. **** \nt;z 
pring “Schein nh": 
brim Lubseol (scheint, ni's 
} 


exit; 


Ht HH HH HH HH HEH HH HH EE HE HEH HE EE HE HE EE EE EH EH EE EE EE HH 
Ht tH HH HH HH HH HEH HH HH HH 

# Subroutines for Example 11-5 

a tH HH HH HE HEH HH HH HEH EH HE EE EE HE EE EE EEE OE EEE HH 
tH HH HH HH HH HEH HH HH HF 





# parsePDBrecordtypes 
# 
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it 





#--given an array of a PDB file, return a hash with 
keys = record type names 
values = scalar containing lines for that record type 


it 


sub parsePDBrecordtypes { 


my @file = @ ; 


use strict; 
use warnings; 


my srecordtypes = ( ); 


foreach my Sline (@file) { 


# Get the record type name which begins at the 
# start of the line and ends at the first space 
my (Srecordtype) =. (6line =~ /*(\8+)/7)% 


# .= fails if a key is undefined, so we have to 
# test for definition and use either .= or = 





depending 


} 


if (defined Srecordtypes{Srecordtype} ) { 
Srecordtypes{Srecordtype} .= Sline; 
}else{ 
Srecordtypes{Srecordtype} = Sline; 
i 


return %Srecordtypes; 


# extractSEQRES 





# 
#--given an scalar containing SEQRES lines, 
7 return an array containing the chains of the sequence 


sub extractSEQORES {f{ 





use strict; 
use warnings; 


my(Sseqres) = @ ; 


my Slastchain = ''; 


my Ssequence = ; 
my @results = ( ); 
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# make array of lines 
my @record = split ( /\n/, Sseqres); 


foreach my Sline (@record) { 
# Chain is in column 12, residues start in column 























20 
my ($thischain) = substr(S$line, 11, 1); 
my(Sresidues) = substr(Sline, 19, 52); # add space 
at end 
# Check if a new chain, or continuation of previous 
chain 
if("Slastchain" eq "") { 
Ssequence = Sresidues; 
}elsif ("Sthischain" eq "Slastchain") { 
Ssequence .= Sresidues; 
# Finish gathering previous chain (unless first 
record) 
}elsif ( Ssequence ) { 
push(@results, Ssequence) ; 
Ssequence = Sresidues; 
i 
Slastchain = Sthischain; 
} 
# save last chain 
push(@results, Ssequence) ; 
return @results; 
} 
# Lub3tol 
# 


#--change string of 3-character IUB amino acid codes 


(whitespace separated) 
# into a string of 1l-character amino acid codes 


sub Lubstoel, 4 
my(Sinput) = @ ; 
my *sthree2one = ( 

'ALA' => 'A', 


nA? => yet, 
"LEU! => a Se 
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‘The* = Ti, 
"PRO! => ‘St 
‘TRPpt => 7H", 
‘Pie => *F*, 
‘MET! => "MT, 
'Chy!' => %G", 
'SER' => 1S', 
Te? => Fat. 
‘TYR. => Yt, 
‘exe? =x TCR, 
"ASN! => tit, 
'GLN! => ret, 
"LYS! => sae 
"ARG! => a: ar 
"HITS! => ‘Et 
‘ASP! => Sita 
'CLU' => int, 
Ne 


# clean up the input 

Sinput =~ 8/\n/ /o: 

my Sseq = ''; 

# This use of split separates on any contiguous 


whitespace 
my @code3 = split(' ', Sinput); 


foreach my $code (@code3) { 
# A little error checking 
if(not defined $three2one{$code}) { 
print "Code Scode not defined\n"; 
next; 
} 
Sseq .= Sthree2one{Scode}; 
} 
return Sseq; 
} 
It's important to note that the main program, which calls the subroutine 
get file data to read in the PDB file, has included a warning about the potentially 
large size of any given PDB file. (For instance, the PDB file 1gav weighs in at 3.45 MB.) 
Plus, the main program follows the reading in of the entire file, with the subroutine 
parsePDBrecordtypes that makes copies of all lines in the input file, separated by 
record type. At this point, the running program is using twice the amount of memory as 
the size of the file. This design has the advantage of clarity and modularity, but it can 
cause problems if main memory is in short supply. The use of memory can be lessened 
by not saving the results of reading in the file, but instead passing the file data directly to 
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the parsePDBrecordtypes subroutine, like so: 

# Get the file data and parse the record types of the PDB 
file 

srecordtypes = 
parsePDPrecordtypes (get file data (*pdb/cl/pdblelrvent")); 








Further savings of memory are possible. For instance, you can rewrite the program to just 
read the file one line at a time while parsing the data into the record types. I point out 
these considerations to give you an idea of the kinds of choices that are practically 
important in processing large files. However, let's stick with this design for now. It may 
be expensive in terms of memory, but it's very clear in terms of overall program structure. 


In Chapter 10, I demonstrated two ways to parse GenBank files into sequence and 
annotation and then how to parse the annotation into finer and finer levels of detail. 


The first method involved iterating through an array of the lines of the record. Recall that 
due to the structure of multiline fields, it was necessary to set flags while iterating to keep 
track of which field the input line was in.”! 


1 In GenBank, the multiline information sets were called fields; in PDB, they're called record 
types. Just as in biology different researchers may use their own terminology for some 
structures or concepts, so too in computer science there can be a certain creativity in 
terminology. This is one of the interesting difficulties in integrating biological data sources. 


The other method, which worked better for GenBank files, involved regular expressions. 
Which method will work best for PDB files? (Or should you settle on a third approach?) 


There are several ways to extract this information. PDB makes it easy to collect record 
types, since they all start with the same keyword at the beginning of the line. The 
technique in the last chapter that used regular expressions parsed the top-level fields of 
the file; this would be somewhat unwieldy for PDB files. (See the exercises at the end of 
the chapter.) For instance, a regular expression such as the following matches all adjacent 
SEQRES lines into a scalar string: 


Srecord =~ /SEORES:*\n (SEORES.*\n) */; 

Sseqres = $&; 

The regular expression matches a single SEQRES line with SEQRES.*\n and then 
matches zero or more additional lines with (SEQRES.*\n) *. Notice how the final * 
indicates zero or more of the preceding item, namely, the parenthesized expression 
(SEQRES.*\n). Also note that the .* matches zero or more nonnewline characters. 
Finally, the second line captures the pattern matched, denoted by $&, into the variable 
Sseqres. 


To extend this to capture all record types, see the exercises at the end of the chapter. 
For PDB files, each line starts with a keyword that explicitly states to which record type 


that line belongs. You will find in the documentation that each record type has all its lines 
adjacent to each other in a group. In this case, it seems that simply iterating through the 
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lines and collecting the record types would be the simplest programming approach. 


Example 11-5 contains a subroutine parsePDBrecordtypes that parses the PDB 
record types from an array containing the lines of the PDB record. This is a short, clean 
subroutine that accomplishes what is needed. The comments describe what's happening 
pretty well, which, as you know, is a critical factor in writing good code. Basically, each 
line is examined for its record type and is then added to the value of a hash entry with the 
record type as the key. The hash is returned from the subroutine. 


11.4.1 Extracting Primary Sequence 


Let's examine the subroutine extractSEQRES , now that the record types have been 
parsed out, and extract the primary amino acid sequence. 


You need to extract each chain separately and return an array of one or more strings of 
sequence corresponding to those chains, instead of just one sequence. 


The previous parse, in Example 11-4, left the required SEQRES record type, which 
stretches over several lines, in a scalar string that is the value of the key 'SEQRES' ina 
hash. Our success with the previous parsePDBrecordtypes subroutine that used iteration 
over lines (as opposed to regular expressions over multiline strings) leads to the same 
approach here. The sp/it Perl function enables you to turn a multiline string into an array. 


As you iterate through the lines of the SEQRES record type, notice when a new chain is 
starting, save the previous chain in @results, reset the $sequence array, and reset 
the $lastchain flag to the new chain. Also, when done with all the lines, make sure to 
save the last sequence chain in the @results array. 


Also notice (and verify by exploring the Perl documentation for the function) that split, 
with the arguments you gave it, does what you want. 


The third and final subroutine of Example 11-5 is called iwb3tol . Since in PDB the 
sequence information is in three-character codes, you need this subroutine to change 
those sequences into one-character codes. It uses a straightforward hash lookup to 
perform the translation. 


We've now decomposed the problem into a few complementary 
subroutines. It's always interesting as to how to best divide a problem 
into cooperating subroutines. You can put the call to jub3tol inside the 
extractSEQRES subroutine; that might be a cleaner way to package these subroutines 
together, since, outside the PDB file format, you won't have use for the strings of amino 
acids in three-character codes. 


The important observation at this juncture is to point out that a few short subroutines, tied 
together with a very short main program, were sufficient to do a great deal of parsing of 
PDB files. 
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11.4.2 Finding Atomic Coordinates 


So far, I've tried not to give more than a very brief overview of protein structure. 
However, in parsing PDB files, you will be faced with a great deal of detailed 
information about the structures and the experimental conditions under which they were 
determined. I will now present a short program that extracts the coordinates of atoms in a 
PDB file. I don't cover the whole story: for that, you will want to read the PDB 
documentation in detail and consult texts on protein structure, X-ray crystallography, and 
NMR techniques. 


That said, let's extract the coordinates from the ATOM record type. ATOM record types 
are the most numerous of the several record types that deal with atomic-coordinate data: 
MODEL, ATOM, SIGATM, ANISOU, SIGUIJ, TER, HETATM, and ENDMDL. There 
are also several record types that handle coordinate transformation: ORIGXn, SCALEn, 
MTRIXn, and TVECT. 


Here is part of the PDB documentation that shows the field definitions of each ATOM 
record: 


ATOM 
Overview 


The ATOM records present the atomic coordinates for 
standard residues. 

They also present the occupancy and temperature factor for 
each atom. 

Heterogen coordinates use the HETATM record type. The 
element symbol 

is always present on each ATOM record; segment identifier 
and charge 

are optional. 





Record Format 











COLUMNS DATA TYPE FLED DEFINITION 
ro) Record name "ATOM " 
se hed. Integer serial Atom serial 
number. 
13 = “16 Atom name Atom name. 
ie) Character altLoc Alternate 
Lh6Geri en. AndLeater. 
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18 - 20 Residue name 
22 Character 
identifier. 

Zo = 26 Integer 


sequence number. 


27 AChar 
insertion of residues. 


31 - 38 Real (8.3) 
coordinates for X in 


39 - 46 Real (8.3) 
coordinates for Y in 


47 - 54 Real (8.3) 
coordinates for Z in 





55 = 60 Real (6.2) 
61 - 66 Real (6.2) 
factor. 

is = 76 LString (4) 


identifier, left-justified. 





Tl = TS LString (2) 
right-justified. 


12 = 8 LString (2) 
atom. 


Here is a typical ATOM line: 


ATOM 1 N GLY A 2 
1200 36.63 N 


resName 


chainID 


resseq 


1Code 


occupancy 


tempFactor 


segID 


element 


charge 


Residue name. 





Chain 


Residue 


Code for 


Orthogonal 
Angstroms. 
Orthogonal 
Angstroms. 
Orthogonal 
Angstroms. 
Occupancy « 


Temperature 


Segment 


Element symbol, 


Charge on the 


L888 =-6.251 =2.511 





Let's do something fairly simple: let's extract all x, y, and z coordinates for each atom, 
plus the serial number (a unique integer for each atom in the molecule) and the element 
symbol. Example 11-6 is a subroutine that accomplishes that, with a main program to 


exercise the subroutine. 
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Example 11-6. Extract atomic coordinates from PDB file 


#!/usr/bin/perl 
# Extract atomic coordinates from PDB file 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# Read in PDB file 
my €file = géet_ file data (’pdb/cl/pdblelf.ent"); 





# Parse the record types of the PDB file 
my Ssrecordtypes = parsePDBrecordtypes (@file) ; 














# Extract the atoms of all chains in the protein 
my Satoms = parseATOM ( Srecordtypes{'ATOM'} ); 


# Print out a couple of the atoms 
print Satoms{*1'’}, “wn; 
bVIneE Sabame( "107s" ),. “n”} 





exit; 


a tH HH HH HH HEH HH HH EE HE HEH HE EE EH EH EE EH EOE EE HH 
at tH HH HH HH HH HE HH HH HH HE 

# Subroutines of Example 11-6 

He tH HH HH HH HEH HH HH EH EE HEH EE EH EE HH EE EH HH EO EH HH 
TH HH HH HH HH HEH HE HH HH 





# parseATOM 


# 

# --extract x, y, and z coordinates, serial number and 
element symbol 

# from PDB ATOM record type 

# Return a hash with key=serial number, 


value=coordinates in a string 
sub parseATOM { 
my(Satomrecord) = @ ; 


use strict; 
use warnings; 


my results = ( ); 
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# Turn the scalar into an array of ATOM lines 
my(@atomrecord) = split(/\n/, S$atomrecord) ; 


foreach my Srecord (@atomrecord) { 

















my Snumber = substr(S$record, 6, 5); # columns 7- 
mt my Sx = substr($record, 30, 8); # columns 31- 
a my Sy = substr($record, 38, 8); # columns 39- 
. my $2 = substr(Srecord, 46, 8); # columns 47- 
my Selement = substr(Srecord, 76, 2); # columns 77- 
78 


# Snumber and Selement may have leading spaces: 
strip them 

Snumber =~ s/*\s*//; 

Selement =~ s/*\s*//; 


# Store information in hash 
Sresults{Snumber} = "Sx Sy $z Selement"; 


} 


# Return the hash 
return %results; 


} 


The parseATOM subroutine is quite short: the strict format of these ATOM records 
makes parsing the information quite straightforward. You first split the scalar argument, 
which contains the ATOM lines, into an array of lines. 


Then, for each line, use the substr function to extract the specific columns of the line 
that contains the needed data: the serial number of the atom; the x, y, and z coordinates; 
and the element symbol. 


Finally, save the results by making a hash with keys equal to the serial numbers and 
values set to strings containing the other four relevant fields. Now, this may not always 
be the most convenient way to return the data. For one thing, hashes are not sorted on the 
keys, so that would need to be an additional step if you had to sort the atoms by serial 
number. In particular, an array is a logical choice to store information sorted by serial 
number. Or, it could be that what you really want is to find all the metals, in which case, 
another data structure would be suggested. Nevertheless, this short subroutine shows one 
way to find and report information. 


It often happens that what you really need is a reformatting of the data for use by another 
program. Using the technique of this subroutine, you can see how to extract the needed 
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data and add a print statement that formats the data into the desired form. Take a look 
at the printf and sprintf functions to get more specific control over the format. For real 
heavy-duty formatting, there's the format function, which merits its own chapter in 
O'Reilly's comprehensive Programming Perl. (See also Chapter 12 and Appendix B 
of this book.) 


Here's the output from Example 11-6: 


1.888 HOa2 54 -2.511 N 
18.995 =LO.180 LOTTE © 





You can now extract at least the major portion of the atomic coordinates from a PDB file. 
Again, notice the good news: it doesn't take a long or particularly complex program to do 
what is needed. 


This program has been designed so that its parts can be used in the future to work well for 
other purposes. You parse all record types, for instance, not only the ATOM record types. 
Let's take a look at a very short program that just parses the ATOM record type lines 
from an input file; by targeting only this one problem, you can write a much shorter 
program. Here's the program: 


while(<>) { 
/*ATOM/ or next; 


my ($n, $x, Sy, $z, Selement) 
= ($= 
P47 Gls (31) S419 C18) ate Ce OP) e122) ead be 


# Sn and Selement may have leading spaces: strip 


them 
on =~ s/*\s*//; 
Selement =~ s/*\s*//; 
Lf (Sn == 1) Oo (Sn S= 1078). fF 


Printr “S8.3fS8.3fS8.3£ S2a\n", Sx, SY, Sz, 
Selement; 
} 
} 


For each line, a regular-expression match extracts just the needed information. Recall that 
a regular expression that contains parentheses metacharacters returns an array whose 
elements are the parts of the string that matched within the parentheses. You assign the 
five variables Snumber, $x, Sy, $z, and Selement to the substrings from these 
matched parentheses. 


The actual regular expression is simply using dots and the quantifier operator . {num} to 
stand for num characters. In this way, you can, starting from the beginning of the string as 
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represented by the caret “ metacharacter, specify the columns with the information you 
want returned by surrounding them with parentheses. 


For instance, you don't want the first six characters, so you specify them as *. {6}, but 
you do want the next five characters because they contain the serial number of the atom; 
so, specify that field as (.{5}). 


Frankly, I think that the use of substr is clearer for this purpose, but I wanted to show 
you an alternative way using regular expressions as well. 


We've already seen the use of the printf function to format output with more options 
then with the print function. 


This program has another important shortcut. It doesn't specify the file to open and read 
from. In Perl, you can give the input filename on the command line (or drag and drop it 
onto a Mac droplet), and the program takes its input from that file. Just use the angle 
brackets as shown in the first line of the program to read from the file. For short, fast 
programs, such as the one demonstrated here, this is a great convenience. You can leave 
out all the calls and tests for success of the open function and just use the angle brackets. 
You would call it from the command line like so, assuming you saved the program in a 
file called get two_atoms: 


sperl get_two_atoms pdbla4o.ent 


Alternatively, you can pipe data to the program with the commands: 


ie 


= Cat pdblad0.cat | perl get two_atoms 


or: 


ie 


% perl get_two_ atoms < pdbla40.ent 
and use <STDIN> instead of <> in your program to read the data. 


11.5 Controlling Other Programs 


Perl makes it easy to start other programs and collect their output, all from within your 
Perl program. This is an extremely useful capability; for most programs, Perl makes it 
fairly simple to accomplish. 


You may need to run some particular program many times, for instance over every file in 
PDB to extract secondary structure information. The program itself may not have a way 
to tell it "run yourself over all these files." Also, the output of the program may have all 
sorts of extraneous information. What you need is a much simpler report that just 
presents the information that interests you—perhaps in a format that could then be input 
to another program! With Perl you can write a program to do exactly this. 


An important kind of program to automate is a web site that provides some useful 
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program or data online. Using the appropriate Perl modules, you can connect to the web 
site, send it your input, collect the output, and then parse and reformat as you wish. It's 
actually not hard to do! O'Reilly's Per! Cookbook, a companion volume to 
Programming Perl, is an excellent source of short programs and helpful descriptions 
to get you started. 


Perl is a great way to automate other programs. The next section shows an example of a 
Perl program that starts another program and collects, parses, reformats, and outputs the 
results. This program will control another program on the same computer. The example 
will be from a Unix or Linux environment; consult your Perl documentation on how to 
get the same functionality from your Windows or Macintosh platform. 


11.5.1 The Stride Secondary Structure Predictor 


We will use an external program to calculate the secondary structure from the 3D 
coordinates of a PDB file. As a secondary structure assignment engine, I use a program 
that outputs a secondary structure report, called stride. stride is available from EMBL 
(http://www.embl-heidelberg.de/stride/stride_info.html) and runs on Unix, 
Linux, Windows, Macintosh, and VMS systems. The program works very simply; just 
give it a command-line argument of a PDB filename and collect the output in the 
subroutine Ca//_ stride that follows. 

Example 11-7 is the entire program: two subroutines and a main program, followed 
by a discussion. 


Example 11-7. Call another program for secondary structure prediction 
#!/usr/bin/perl 
# Call another program to perform secondary structure 


prediction 


use strict; 
use warnings; 


# Call "stride" on a file, collect the report 
my (Cstridse ourput) = gall stride (*pdb/elypdblclt.ent’ ); 


# Parse the stride report into primary sequence, and 


secondary 
# structure prediction 
my (Ssequence, $structure) = parse stride(@stride output); 


# Print out the beginnings of the sequence and the 
secondary structure 

print substr(S$sequence, 0, 80), "\n"; 

Prine substr Sstructire, 0, 80), “\wi"s 


exit; 
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tt tH HH HH HH HEH HH HH EH HE HEH HH EE EE HE EE EE EH EE EE EH HH 
a TH HH HH HH HH HEH HH HHH 

# Subroutine for Example 11-7 

Ht tH HH HH HH HEH HH HH EE HE HEH HE EE EE HE HE EE EH EE EE EH HH 
HTH HH HH HH HH HE HH HH HH 





# call stride 


it 

# --given a PDB filename, return the output from the 
"stride" 

# secondary structure prediction program 


sub call stride: { 


use strict; 
use warnings; 


my ($filename) = @ ; 


# The stride program options 


my ($stride) = '/usr/local/bin/stride'; 
my(Soptions) = ''; 
my(@results) = ( ); 


# Check for presence of PDB file 
unless ( -e $filename ) { 
print "File \"Sfilename\" doesn\'t seem to 
exist! \n"; 
exit; 


} 


# Start up the program, capture and return the output 
@results = “Sstride Soptions $filename’; 


return @results; 


} 


# parse stride 

# 

#--given stride output, extract the primary sequence and 
the 

# secondary structure prediction, returning them in a 
# two-element array. 





sub parse stride { 


IT-SC 305 


use strict; 
use warnings; 


my (@stridereport) = @ ; 
my($seq) = ''; 
my(S$str) = ''; 





my Slength; 





# Extract the lines of interest 
my (@seq) = grep(/*SEQ /, @stridereport 


mee 
Ne 








my(@str) = grep(/*STR /, @stridereport 


~~ 


° 
, 


# Process those lines to discard all but the sequence 
# or structure information 
for (@seq) { $ = substr($_, 10, 50) } 


for (@str) { $ = substr($_, 10, 50) } 





# Return the information as an array of two strings 
Sseq = join('', @seq); 
Sstr = join('', @str); 


# Delete unwanted spaces from the ends of the strings. 
# (Sseq has no spaces that are wanted, but Sstr may) 


S$seq =~ s/(\st)S$//; 


Slength = length(S$1); 


2 


Sstr =~ s/\s{Slength}$//; 





return( (Sseq, Sstr) ); 


} 


As you can see in the subroutine call_stride, variables have been made for the program 
name (Sstride) and for the options you may want to pass (Soptions). Since these 
are parts of the program you may want to change, put them as variables near the top of 
the code, to make them easy to find and alter. The argument to the subroutine is the PDB 
filename ($filename). (Of course, if you expect the options to change frequently, you 
can make them another argument to the subroutine.) 


Since you're dealing with a program that takes a file, do a little error checking to see if a 
file by that name actually exists. Use the -e file test operator. Or you can omit this and 
let the stride program figure it out, and capture its error output. But that requires parsing 
the stride output for its error output, which involves figuring out how stride reports errors. 
This can get complicated, so I'd stick with using the -e file test operator. 


The actual running of the program and collecting its output happens in just one line. The 
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program to be run is enclosed in backticks, which run the program (first expanding 
variables) and return the output as an array of lines. 


There are other ways to run a program. One common way is the system function call. It 
behaves differently from the backticks: it doesn't return the output of the command it 
calls (it just returns the exit status, an integer indicating success or failure of the 
command). Other methods include qx , the open system call, and the fork and exec 
functions. 


11.5.2 Parsing Stride Output 


I don't go into too much detail here about parsing the output of stride. Let's just exhibit 
some code that extracts the primary sequence and the secondary structure prediction. See 
the exercises at the end of the chapter for a challenge to extract the secondary structure 
information from a PDB file's HELIX, SHEET, and TURN record types and output the 
information in a similar format as stride does here. 


Here is a typical section of a stride output (not the entire output): 


SEQ 1 

MDKNELVOKAKLABRQAERY DDMAACMKSVTEQGAELSNEERNLLSVAYKN 50 
1A40 

STR HHHHHHHHAHHAHH HAHHAHHAHHAAATTT 
HHHHHHHHAHHAH 1A40 

REM 
1A40 
REM 
1A40 

SEQ $1 

VVGARRSSWRVVSS IEQKEKKQOQMAREYREKIETELRDICNDVLSLLEKF 100 
1A40 

STR 
HHHHHHHHAHHAHHAHAAHAAHAAHAAHAAHAAAAAAAARHAAAAAAAAa 
1A40 

REM 
1A40 
REM 
1A40 

SEQ 101 
LIPNAAESKVFYLKMKGDYYRYLAEVAAGDDKKGIVDQSOQQAYOEAFETIS 150 
1A40 

STR TTTTT HHHHHHHHHHHHHHAHHAAH 
HHHHHHHHHHHAHHAHHAHHA 1A40 

REM 
1A40 
REM 
1A40 
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SEQ 151 KKEMIRLGLALNFSVFYYACSLAKTAFDEATAELLIMOLLRDNLTLW 
Noy 1A40 

STR TTTTHHHHHHHHHHHAH HHHHHHHHHHHHH HHAHHAHHAH 
1A40 








Notice that each line is prefaced by an identifier, which should make collecting the 
different record types easy. Without even consulting the documentation (a slightly 
dangerous but sometimes expedient approach), you can see that the primary sequence has 
keyword SEQ, the structure prediction has keyword STR, and the data of interest lies 
from position 11 up to position 60 on each line. (We'll ignore everything else for now.) 


The following list shows the one-letter secondary structure codes used by stride: 


H Alpha helix 

G 3-10 helix 

I PI helix 

E Extended conformation 
Borb Isolated bridge 

T Turn 

C Coil (none of the above) 


Using the substr function, the two for loops alter each line of the two arrays by 
saving the 11th to the 60th positions of those strings. This is where the desired 
information lies. 


Now let's examine the subroutine parse_stride in Example 11-7 that takes stride 
output and returns an array of two strings, the primary sequence and the structure 
prediction. 


This is a very "Perlish" subroutine that uses some features that manipulate text. What's 
interesting is the brevity of the program, which some of Perl's built-in functions make 
possible. 


First, you receive the output of the stride program in the subroutine argument @_. Next, 
use the grep function to extract those lines of interest, which are easy to identify in this 
output, as they begin with clear identifiers SEQ and STR. 
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Next, you want to save just those positions (or columns) of these lines that have the 
sequence or structure information; you don't need the keywords, position numbers, or the 
PDB entry name at the end of the lines. 


Finally, join the arrays into single strings. Here, there's one detail to handle; you need to 
remove any unneeded spaces from the ends of the strings. Notice that stride sometimes 
leaves spaces in the structure prediction, and in this example, has left some at the end of 
the structure prediction. So you shouldn't throw away all the spaces at the ends of the 
strings. Instead, throw away all the spaces at the end of the sequence string, because they 
are just superfluous spaces on the line. Now, see how many spaces that was, and throw 
the equal amount away at the end of the structure prediction string, thus preserving 
spaces that correspond to undetermined secondary structure. 
Example 11-7 contains a main program that calls two subroutines, which, since they 
are short, are all included (so there's no need here for the BeginPer/lBioinfo module). 
Here's the output of Example 11-7: 
GGLOVKNFDFTVGKFLTVGGFINNSPORFSVNVGESMNSLSLHLDHREFNYGADONTIVM 
NSTLKGDNGWETEQRSTNFTL 

TTITTTTBTTT HEEEEEETTTT 
BEBRRERBERTTERRBRRERERRRETTERREREERREERETTGGG B EEE 














The first line shows the amino acids, and the second line shows the prediction of the 
secondary structure. Check the next section for a subroutine that will improve that output. 


11.6 Exercises 


Exercise 11.1 


Use File: :Find and the file test operators to find the oldest and largest files on 
the hard drive of your computer. (You can delete them or store them elsewhere if 
you're running short on disk space.) 


Exercise 11.2 


Find all the Perl programs on your computer. 
Hint: Use File: : Find. What do all Perl programs have in common? 


Exercise 11.3 


Parse the HEADER, TITLE, and KEYWORDS record types of all PDB files on 
your computer. Make a hash with key as a word from those record types and 
value as a list of filenames that contained that word. Save it as a DBM file and 
build a query program for it. In the end, you should be able to ask for, say, sugar, 
and get a list of all PDB files that contain that word in the HEADER, TITLE, or 
KEYWORDS records. 


Exercise 11.4 


Parse out the record types of a PDB file using regular expressions (as used in 
Chapter 10) instead of iterating through an array of input lines (as in this 
chapter.) 
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Exercise 11.5 


Write a program that extracts the secondary structure information contained in the 
HELIX, SHEET, and TURN record types of PDB files. Print out the secondary 
structure and the primary sequence together, so that it's easy to see by what 
secondary structure a given residue is included. (Consider using a special alphabet 
for secondary structure, so that every residue in a helix is represented by H, for 
example.) 


Exercise 11.6 


Write a program that finds all PDB files under a given folder and runs a program 
(such as Stride, or the program you wrote in Exercise 11.5) that reports on the 
secondary structure of each PDB file. Store the results in a DBM file keyed on the 
filename. 


Exercise 11.7 


Write a subroutine that, given two strings, prints them out one over the other, but 
with line breaks (similar to the stride program output). Use this subroutine to 
print out the strings from Example 11-7. 


Exercise 11-8 


Write a recursive subroutine to determine the size of an array. You may want to 
use the pop or unshift functions. (Ignore the fact that the scalar @ array 
returns the size of @array!) 


Exercise 11.9 


Write a recursive subroutine that extracts the primary amino acid sequence from 
the SEQRES record type of a PDB file. 


Exercise 11.10 


(Extra credit) Given an atom and a distance, find all other atoms in a PDB file 
that are within that distance of the atom. 


Exercise 11.11 


(Extra credit) Write a program to find some correlation between the primary 
amino acid sequence and the location of alpha helices. 
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Chapter 12. BLAST 


In biological research, the search for sequence similarity is very important. For instance, 
a researcher who has discovered a potentially important DNA or protein sequence wants 
to know if it's already been identified and characterized by another researcher. If it hasn't, 
the researcher wants to know if it resembles any known sequence from any organism. 
This information can provide vital clues as to the role of the sequence in the organism. 


The Basic Local Alignment Search Tool (BLAST) is one of the most popular software 
tools in biological research. It tests a query sequence against a library of known 
sequences in order to find similarity. BLAST is actually a collection of programs with 
versions for query-to-database pairs such as nucleotide-nucleotide, protein-nucleotide, 
protein-protein, nucleotide-protein, and more. 


This chapter examines the output from the nucleotide-nucleotide version of the program, 
BLASTN . For simplicity's sake, I'll simply refer to it here as BLAST. The main goal of 
this chapter is to show how to write code to parse a BLAST output file using regular 
expressions. The code is simple and basic, but it does the job. Once you understand the 
basics, you can build more features into your parser or obtain one of the fancier BLAST 
output parsers that's available via the Web. In either case, you'll know enough about 
output parsers to use or extend them. 


This chapter also gives you a brief introduction to Bioperl, which is a collection of Perl 
bioinformatics modules. The Bioperl project is an example of an open source project that 
you, the Perl bioinformatics programmer, can put to good use. The Perl programming 
language is itself an open source project. The program and its source code are available 
for use and modification with only very reasonable restrictions and at no cost. 


12.1 Obtaining BLAST 


There are a several implementations of BLAST. The most popular is probably the one 
offered free of charge by the National Center for Biotechnology Information (NCBI): 
http://www.ncbi.nim.nih.gov/BLAST/. The NCBI web site features a publicly 
available BLAST server, a comprehensive set of databases, and a well-organized 
collection of documents and tutorials, in addition to the BLAST software available for 
downloading. 


Also popular is the WU-BLAST implementation from Washington University. The main 
web site, including a list of other WU-BLAST servers, can be found at 
http://blast.wustl.edu. Older versions of WU-BLAST are available at no charge. 
Newer versions are free if you qualify as a research or nonprofit organization and agree 
to the licensing arrangements from Washington University where the program is 
developed and maintained. If you work at a major research organization, you may already 
have a site license for the WU-BLAST program. If you are a for-profit company, there is 
a rather hefty charge for the newer WU-BLAST program (older versions are freely 
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available if you want to run BLAST on your own computer). Pennsylvania State 
University also develops some BLAST programs, available at 
http://bio.cse.psu.edu/. In addition to NCBI and WU-BLAST, many other BLAST 


server web sites are available. A Google search (http://www.google.com) on 
"BLAST server" will bring up many hits. 


A big question that faces researchers when they use BLAST is whether to use a public 
BLAST server or to run it locally. There are significant advantages to using a public 
server, the largest being that the databases (such as GenBank) used by the BLAST server 
are always up to date. To keep your own up-to-date copy of these databases requires a 
significant amount of hard-disk space, a computer with a fairly high-end processor and a 
lot of memory (to run the BLAST engine), a high-capacity network link, and a lot of time 
setting up and overseeing the software that updates the databases. On the other hand, 
perhaps you have your own library of sequences that you want to use in BLAST searches, 
you do frequent or large searches, or you have other reasons to run your own in-house 
BLAST engine. If that's the case, it makes sense to invest in the hardware and run it 
locally. 


The online documentation for BLAST is fairly extensive and includes details on the 
statistical methods the program uses to calculate similarity. In the next section, I touch 
briefly on some of those points, but you should refer to the BLAST home page and to the 
excellent material at the NCBI web site for the whole story and detailed references. Our 
interest here is not the theory, but rather to parse the output of the program. 


12.2 String Matching and Homology 


String matching is the computer-science term for algorithms that find one string 
embedded in another. It has a fairly long and fruitful history, and many string-matching 
algorithms have been developed using a variety of techniques and for different cases. 
(See the Gusfield book in Appendix _A for an excellent treatment with a biological 
emphasis.) We've already done a fair amount of string matching, using the binding 
operator to search for motifs and other text with regular expressions. 


BLAST is basically a string-matching program. Details of the string-matching algorithms, 
and of the algorithms used in BLAST in particular, are beyond the scope of this book. 
But first I want to define some terms that are frequently confused or used interchangeably. 
I also briefly introduce the BLAST statistics. 


Biological string matching looks for similarity as an indication of homology. Similarity 
between the query and the sequences in the database may be measured by the percent 
identity, or the number of bases in the query that exactly match a corresponding region of 
a sequence from the database. It may also be measured by the degree of conservation, 
which finds matches between equivalent (redundant) codons or between amino acid 
residues with similar properties that don't alter the function of a protein (see Chapter 8). 
Homology between sequences means the sequences are related evolutionarily. Two 
sequences are or are not homologous; there's no degree of homology. 
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At the risk of oversimplifying a complex topic, I'll summarize a few facts about BLAST 
statistics. (See the BLAST documentation for a complete picture.) The output of a 
BLAST search reports a set of scores and statistics on the matches it has found based on 
the raw score S, various parameters of the scoring algorithm, and properties of the query 
and database. The raw score S is a measure of similarity and the size of the match. The 
BLAST output lists the hits ranked by their E value. The E (expect) value of a match 
measures, roughly, the chances that the string matching (allowing for gaps) occurs in a 
randomly generated database of the same size and composition. The closer to 0 the E 
value is, the less likely it occurred by chance. In other words, the lower the E value, the 
better the match. As a general rule of thumb for BLASTN, an E value less than 1 may be 
a solid hit, and an E value of less than 10 may be worth looking at, but this is not a hard 
and fast rule. (Of course, proteins can be homologous with even a very small percent 
identity; the percent similarity is typically higher for homologous DNA.) 


Now that you have the basics, let's write code to parse BLAST output. First, you separate 
the hits, then extract the sequence, and finally, you find the annotation showing the E 
value statistic. 


12.3 BLAST Output Files 


The following is part of a BLAST output file. I created it by entering a few lines of the 
sample.dna file from Chapter 8 into the BLAST program at the NCBI web site, 
without changing any of the default parameters. I then saved the output as text in the file 
bist.txt, which is available from this book's web site. I've used it repeatedly in the 
parsing routines throughout this chapter. Because the output is several pages long, I've 
truncated it here to show the beginning, the middle, and the end of the file. 

BLASTN 2.1.3 [Apr-11-2001] 





Reference: Altschul, Stephen F., Thomas L. Madden, 
Alejandro A. Schaffer, 

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. 
Lipman (1997), 

"Gapped BLAST and PSI-BLAST: a new generation of protein 
database search 

programs", Nucleic Acids Res. 25:3389-3402. 

RID: 991533563-27495-9092 

Query= 





(400 letters) 


Database: nt 
868,831 sequences; 3,298,558,333 total letters 


score E 
Sequences producing significant alignments: 
(bits) Value 





IT-SC 313 


dbj |AB031069.1/ABO 































































































31069 Homo sapiens PCCX1 mRNA for protein 























COTE a ies YS? Os) 

ref|NM_014593.1| Homo sapiens CpG binding protein (CGBP), 
mRNA 779 0.0 

gb|AF149758.1|/AF149758 Homo sapiens CpG binding protein 
(CGRP) Mess 779 0.0 

ref|XM_008699.3| Homo sapiens CpG binding protein (CGBP), 
mRNA Teor O%0 

emb | AL136862.1|HSM801830 Homo sapiens mRNA; cDNA 
DKFZp434F174 (f... 450 e-124 

emb |AJ132339.1|HSA132339 Homo sapiens CpG island sequence, 
SUDCL cus 446 e-123 

emb |AJ236590.1|HSA236590 Homo sapiens chromosome 18 CpG 

US Lane, Disses 406 e-111 

dbj |AKO10337.1|/AK010337 Mus musculus ES cells cDNA, RIKEN 
POLI 1 oss 234 3e-59 

dbj | AKO17941.1/AK017941 Mus musculus adult male thymus cDNA, 
RIK «ss 210 5e-52 

gb |ACO009750.7|ACO09750 Drosophila melanogaster, chromosome 
olig Ws 46 0.017 

gb |AE003580.2|AE003580 Drosophila melanogaster genomic 
scaffold 46 0.017 

ref|NC_001905.1| Leishmania major chromosome 1, complete 
sequence 40 20 

gb|AE001274.1|/AE0O01274 Leishmania major chromosome 1, 
complete s... 40 1.0 

gb |AC008299.5|/AC008299 Drosophila melanogaster, chromosome 
BRyp Ps de 39 4al 

gb |AC0O18662.3|/AC0O18662 Human Chromosome 7 clone RP11-339C9, 
COMpDixs « 38 4.1 

gb |AE003774.2|AE003774 Drosophila melanogaster genomic 
scaffold 38 4.1 

gb |ACO008039.1|/ACO008039 Homo sapiens clone SCb-391H5 from 
TQS, Cass 38 4.7] 

gb |ACO005315.2|/AC005315 Arabidopsis thaliana chromosome II 
sectio... 38 4.1 

emb |AL353748.13|AL353748 Human DNA sequence from clone 
BPli=21 78 sax 2o Syl 

ALIGNMENTS 





>dbj |ABO031069.1|/AB031069 Homo sapiens PCCX1 mRNA for 





protein containing CXXC 


domain 1, 


Score 
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complete cds 


Length 


793 Bits 


2487 


(400) 


Expect 0.0 


y 


314 


Tdentities = 400/400 (100%) 
Strand = Plus / Plus 


Query: 1 
agatggcggcgctgaggggtcttgggggctctaggcecggcecacctactggtttgcagcg 
g 60 


NV A PP a 
| 
Soyecs 


agatggcggcgctgaggggtcttgggggctctaggcecggccacctactggtttgcagcg 
g 60 


Query: 61 
agacgacgcatggggcctgcgcaataggagtacgctgcectgggaggcgtgactagaagce 
g 120 


ig) i 0) a a a a a a 
| 

Sbyct: 61 
agacgacgcatggggcctgcgcaataggagtacgctgcectgggaggcgtgactagaagc 
g 120 


Query: 121 
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggt 
t 180 


sn a a a a 
| 

Sbjcts 127 
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggt 
t 180 


Query: 181 


cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggaga 
t 240 


IP GS OO PO i tO TP Gt a a 
| 

Sbjct: 1381 
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggaga 
t 240 


Query: 241 
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaa 
t 300 
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DTIC TCISISTS TIS ICTS ICTS TSS AIST ear 
| 

Sbjct: 241 
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaa 
t 300 


Query: 301 
gqcgcccatckactgcatctgccgocaaaccggacatcaactgcEtcatgatcqggtgtga 
e¢ 360 








aT 
| 

Sbjct: 301 
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtga 
e. 360 


Query: 361 aactgcaatgagtggttccatggggactgcatccggatca 400 
a a a a a a 
Sbjct: 361 aactgcaatgagtggttccatggggactgcatccggatca 400 





>ref|NM_014593.1| Homo sapiens CpG binding protein (CGBP), 
mRNA 


(file truncated here) 


>dbj |AK010337.1|AK010337 Mus musculus ES cells cDNA, RIKEN 
full-length 
enriched library, 
clone:2410002116, full insert sequence 
Length = 2538 











Score = 234 bits (118), Expect = 3e-59 
Identities = 166/182 (91%) 
Strand = Plus / Plus 











Query: 219 
gagcggagatatggagggagatggttcagacccagagcctccagatgccggggaggaca 
g 278 

CSU TAIT ISIE ISIS IS otal 
a 
Sby:cts 260 
gagcggagatatggaaggagatggctcagacctggaacctccggatgccggggacgaca 
ig S19 
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Query: 279 
caagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatca 
a 338 

Me: SITS: aI TS eT a a 

I A a 
Sbycks 320 
caagtctgagaatggggagaacgctcccatctactgcatctgtcgcaaaccggacatca 
a S79 








Query: 339 
ctgcttcatgatcgggtgtgacaactgcaatgagtggttccatggggactgcatccgga 
ic 398 

EP a 
8 
Shi Cr: 380 
ttgcttcatgattggatgtgacaactgcaacgagtggttccatggagactgcatccgga 
t 439 





Query: 399 ca 400 
| | 
Sbjct: 440 ca 441 
Score = 44.1 bits (22), Expect = 0.066 
Identities = 25/26 (96%) 
Strand = Plus / Plus 











Query: 118 gcggaagtagttgtgggcgcectttge 143 
ET TM a ST eT aap 
Sbjct: 147 gcggaagtagttgcgggegectttge 172 
































>dbj |AKO17941.1|/AK017941 Mus musculus adult male thymus 
CDNA, RIKEN 

full-length enriched library, clone:5830420C16, full insert 
sequence 





Length = 1461 


Score = 210 bits (106), Expect = 5e-52 
Identities = 151/166 (90%) 
Strand = Plus / Plus 











Query: 235 
ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatgg 
g 294 

ITS CMS AAI eile cl TES TEICIST GEST TEST ET | 
RT IMTS: TC TCT aT et 
Sbjct: 1048 
ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatgg 


g 1107 
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Query: 295 
gagaatgcgcccatcltactgcatctgccgcaaaccggacatcaactgctrtcatgatcgg 
g 354 








en: ae AT Te ITM STII SASESE eT A TRGIPG ae i 
OES MT UETPIGI + rt 
Sbjct: 1108 
gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgg 
a 1167 


Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 
400 

a 
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 
1213 


Score = 44.1 bits (22), Expect = 0.066 
Identities = 25/26 (96%) 
Strand = Plus / Plus 





Query: 118 gcggaagtagttgtgggcgcectttge 143 
es a a 
Sbjct: 235 gcggaagtagttgcgggcegectttge 260 


>gb|AC009750.7|ACO09750 Drosophila melanogaster, chromosome 
2L, region 23F-24A, 
BAC clone 


(file truncated here) 


>emb |AL353748.13|AL353748 Human DNA sequence from clone 
RP11=317B17 on 
chromosome 9, complete 
sequence [Homo sapiens] 
Length = 179155 





Score = 38.2 bits (19), Expect = 4.1 
Identities = 22/23 (95%) 
Strand = Plus / Plus 


Query: 192 ggcgggggtcgtgagggagtgcg 214 
9 Ge ee a WO 
Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280 


Database: nt 
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Posted date: May 30, 2001 3:54 AM 
Number of letters in database: -996,408,959 


Number of sequences in database: 868,831 
Lambda K H 
Te gat Ore ila ile ab 
Gapped 
Lambda K H 
ils ae 0.711 Leo l 
Matrix: blastn matrix:1 -3 





Gap Penalties: Existence: 5, Extension: 2 
Number of Hits to DB: 436021 
Number of Sequences: 868831 
Number of extensions: 436021 
Number of successful extensions: 7536 
Number of sequences better than 10.0: 19 
length of query: 400 

length of database: 3,298,558, 333 

effective HSP length: 20 

effective length of query: 380 

effective length of database: 3,281,181,713 
effective search space: 1246849050940 
effective search space used: 1246849050940 












































ae 10 
A: 30 
Me 6 “CLO bates) 
XAe dd (29.7 bases) 
Sie 22 (2453 bits) 
S23 19 (38.2 bits) 








As you can see, the file consists of three parts: some header information at the beginning 
followed by a summary of the alignments, the alignments, and then some additional 
summary parameters and statistics at the end. 


12.4 Parsing BLAST Output 

So why parse BLAST output? One reason is to see if your DNA has any new matches 
against the DNA stored in the constantly growing databases. You can write a program to 
automatically perform a daily BLAST search and compare its results with those of the 
previous day by parsing the summary list of hits and comparing it with the previous day's 
summary list. You can then have the program email you if something new has turned up. 


12.4.1 Extracting Annotation and Alignments 


Example 12-1 consists of a main program and two new subroutines. The 
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subroutines—parse_blast and parse_blast_alignment—uase regular expressions 
to extract the various bits of data from a scalar string. I chose this method because the 
data, although structured, does not clearly identify each line with its function. (See the 


discussion in Chapter 10 and Chapter 11.) 
Example 12-1. Extract annotation and alignments from BLAST output file 


#!/usr/bin/perl 
# Extract annotation and alignments from BLAST output file 


use strict; 
use warnings; 


use BeginPerlBioinfo; # see Chapter 6 about this module 


# declare and initialize variables 


my Sbeginning annotation = ''; 
my Sending_annotation = ''; 

my salignments = ( ); 

my $filename = 'blast.txt'; 


parse blast (\Sbeginning annotation, \S$ending annotation, 
\Salignments, $filename) ; 


# Print the annotation, and then 

# print the DNA in new format just to check if we got it 
okay. 

print Sbeginning annotation; 


foreach my Skey (keys %alignments) { 

print "Skey\nXXXXXXXXXXxX\n", Salignments{S$key}, 
"\nXXXXXXXXXXX\n"; 
} 


PYint Sending annotation; 
exit; 


a HH HH HH HH HEH HH HH EH EE HEH HE EH EE HE EH EH EE EE HH 
Ht TH HH HH EH HH HE HH HH HH 

# Subroutines for Example 12-1 

Ht HH HH HH HEH HH EH EH EH HH EE EE HE EE EH HE EE EE HH 
TH HH HH HH HH HEH HH HH HF 








# parse blast 

it 

# --parse beginning and ending annotation, and alignments, 
# from BLAST output file 


IT-SC 320 


sub parse blast 4 


my (Sbeginning annotation, Sending annotation, 








Salignments, $filename) = @ ; 
# Sbeginning annotation--reference to scalar 
# Sending annotation ==reference £0 scalar 
# Salignments --reference to hash 
# Sfilename --scalar 


# declare and initialize variables 
my Sblast output. file = ""; 
my Salignment_section = ''; 





# Get the BLAST program output into an array froma 
file 

Sblast_output_file = join( '', 
get file data ($filename) ); 





# Extract the beginning annotation, alignments, and 
ending annotation 

(SSbeginning annotation, Salignment section, 
$Sending annotation) 

= ($blast_output file =~ /(.**ALIGNMENTS\n) (.*) (% 
Database: .*)/ms); 


# Populate %Salignments hash 
# key = ID of hit 
# value = alignment section 
SSalignments = 
parse blast alignment (Salignment_ section) ; 


} 





# parse blast alignment 
# 
# --parse the alignments from a BLAST output file, 








# return hash with 
# key = ID 
# value = text of alignment 


sub parse blast Sslignment { 


my (Salignment_ section) = @ ; 


# declare and initialize variables 
my(salignment hash) = ( ); 
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# loop through the scalar containing the BLAST 
alignments, 

# extracting the ID and the alignment and storing ina 
hash 

# 


# The regular expression matches a line beginning with 





>, 

# and containing the ID between the first pair of | 
characters; 

# followed by any number of lines that don't begin with 


es 
while (S$alignment section =~ /*>.*\n(*(?!>).*\n)+/gm) { 
my (Svalue) = $&; 
my($key) = (split(/\|/, $value)) [1]; 





Salignment hash{$key} = $value; 
} 


return %Salignment_hash; 


} 


The main program does little more than call the parsing subroutine and print the results. 
The arguments, initialized as empty, are passed by reference (see Chapter 6). 


The subroutine parse_blast does the top-level parsing job of separating the three sections 
of a BLAST output file: the annotation at the beginning, the alignments in the middle, 
and the annotation at the end. It then calls the parse _blast_alignment subroutine to 
extract the individual alignments from that middle alignment section. The data is first 
read in from the named file with our old friend the get file data subroutine from 
Chapter 8. Use the join function to store the array of file data into a scalar string. 


The three sections of the BLAST output file are separated by the following statement: 


(S$Sbeginning annotation, Salignment section, 
$Sending annotation) 


— (Sblast_ output file =~ /(.**ALIGNMENTS\n) (.*) (% 
Database:.*)/ms); 


The pattern match contains three parenthesized expressions: 


(. * ALIGNMENTS \n) 

which is returned into $$beginning annotation; 
ee: 

which is saved in Salignment_ section; and: 

(* Database: .*) 
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which is saved in $Sending annotation. 


The use of $$ instead of $ at the beginning of two of these variables indicates that they 
are references to scalar variables. Recall that they were passed in as arguments to the 
subroutine, where they were preceded by a backslash, like so: 


parse blast (\Sbeginning annotation, \S$ending annotation, 
Salignments, $filename) ; 





You've seen references to variables before, starting in Chapter 6. Let's review them 
briefly. Within the parse blast subroutine, those variables with only one $ are 
references to the scalar variables. They need an additional $ to represent actual scalar 
variables. This is how references work; they need an additional special character to 
indicate what kinds of variables they are references to. So a reference to a scalar variable 
needs to start with $$, a reference to an array variable needs to start with @$, and a 
reference to a hash variable needs to start with 3S. 


The regular expression in the previous code snippet matches everything up to the word 
ALIGNMENTS at the end of a line (.**ALIGNMENTS \n) ; then everything for a while 
(.*); then a line that begins with two spaces and the word Database: followed by the 
rest of the file (* Database: .*). These three expressions in parentheses correspond 
to the three desired parts of the BLAST output file; the beginning annotation, the 
alignment section, and the ending annotation. 


The alignments saved in Salignment section are separated out by the 
subroutine parse_blast_alignment. This subroutine has one important loop: 








while (Salignment section =~ /*>.*\n(*(?!>).*\n)+/gm) { 
my (Svalue) = Sé&; 
my ($key) = (split(/\|/, Svalue)) [1]; 





Salignment hash{$key} = $value; 
} 


You're probably thinking that this regular expression looks downright evil. At first glance, 
regular expressions do sometimes seem incomprehensible, so let's take a closer look. 
There are a few new things to examine. 


The five lines comprise a while loop, which (due to the global /g modifier on the 
pattern match in the while loop) keeps matching the pattern as many times as it appears 
in the string. Each time the program cycles through the loop, the pattern match finds the 
value (the entire alignment), then determines the key. The key and values are saved in the 
hash alignment hash. 


Here's an example of one of the matches that's found by this while loop when parsing 
the BLAST output shown in Section 12.3: 
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>emb |AL353748.13|AL353748 Human DNA sequence from clone 
RP11-317B17 on 
chromosome 9, complete 
sequence [Homo sapiens] 
Length = 179155 


Score = 38.2 bits (19), Expect = 4.1 
Identities = 22/23 (95%) 
Strand = Plus / Plus 








Query: 192 ggcgggggtcgtgagggagtgcg 214 

a a Os a 
Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280 
This text starts with a line beginning with a > character. In the complete BLAST output, 
sections like these follow one another. What you want to do is start matching from a line 
beginning with > and include all following adjacent lines that don't start with a > 
character. You also want to extract the identifier, which appears between the first and 
second vertical bar | characters on the first line (e.g., AL353748 .13 in this alignment). 














Let's dissect the regular expression: 





Salignment section =~ /*>.4\n("(7!>) .*\n) +/om 

This pattern match, which appears in a while loop within the code, has the modifier m 
for multiline. The m modifier allows “ to match any beginning-of-line inside the multiline 
string, and $ to match any end-of-line. 


The regular expression breaks down as follows. The first part is: 


oS eT 

It looks for > at the beginning of the BLAST output line, followed by . *, which matches 
any quantity of anything (except newlines), up to the first newline. In other words, it 
matches the first line of the alignment. 


Here's the rest of the regular expression: 
(*(?!>) .*\n) + 


After the * which matches the beginning of the line, you'll see a negative lookahead 
assertion, (?!>), which ensures that a > doesn't follow. Next, the . * matches all non- 
newline characters, up to the final \n at the end of the line. All of that is wrapped in 
parentheses with a surrounding +, so that you match all the available lines. 


Now that you've matched the entire alignment, you want to extract the key and populate 
the hash with your key and value. Within the while loop, the alignment that you just 
matched is automatically set by Perl as the value of the special variable $& and saved in 
the variable $value. Now you need to extract your key from the alignment. It can be 
found on the first line of the alignment stored in $value, between the first and second | 
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symbols. 


Extracting this identifying key is done using the split function, which 
breaks the string into an array. The call to split: 


split(/\|/, $value) 


splits Svalue into pieces delimited by | characters. That is, the | symbol is used to 
determine where one list element ends and the next one begins. (Remember that the 
vertical bar | is a metacharacter and must be escaped as \ | .) By surrounding the call to 
split with parentheses and adding an array offset ([1]), you can isolate the key and save 
it into Skey. 


Let's step back now and look at Example 12-1 in its entirety. Notice that it's very 
short—barely more than two pages, including comments. Although it's not an easy 
program, due to the complexity of the regular expressions involved, you can make sense 
of it if you put a little effort into examining the BLAST output files and the regular 
expressions that parse it. 


Regular expressions have lots of complex features, but as a result, they can do lots of 
useful things. As a Perl programmer, the effort you put into learning them is well worth it 
and can have significant payoffs down the road. 


12.4.2 Parsing BLAST Alignments 


Let's take the parsing of the BLAST output file a little further. Notice that some of the 
alignments include more than one aligned string—for instance, the alignment for ID 
AK017941.1, shown again here: 


>dbj |AKO17941.1|/AK017941 Mus musculus adult male thymus 
cDNA, RIKEN 
full-length enriched 
library, clone:5830420C16, full insert sequence 
Length = 1461 











Score = 210 bits (106), Expect = 5e-52 
Identities = 151/166 (90%) 
Strand = Plus / Plus 











Query: 235 
ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatgg 
g 294 

Man TMS RCM EST TS UIST Mel I We 
aR TT RTE” SLE dane sT ne 
Sbjct: 1048 
ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatgg 


g 1107 
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Query: 295 
gagaatgcgcccatcltactgcatctgccgcaaaccggacatcaactgctrtcatgatcgg 
g 354 








en: ae AT Te ITM STII SASESE eT A TRGIPG ae i 
OES MT UETPIGI + rt 

Sbjct: 1108 
gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgg 


a L167 


Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 
400 

a a 
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 
1213 





Score = 44.1 bits (22), Expect = 0.066 
Identities = 25/26 (96%) 
Strand = Plus / Plus 





Query: 118 gcggaagtagttgtgggcgcectttge 143 
ee TA TO a | 
Sbjct: 235 gcggaagtagttgcgggegectttge 260 


To parse these alignments, we have to parse out each of the matched strings, which in 
BLAST terminology are called high-scoring pairs (HSPs). 


Each HSP also contains some annotation, and then the HSP itself. Let's parse each HSP 
into annotation, query string, and subject string, together with the starting and ending 
positions of the strings. More parsing is possible; you can extract specific features of the 
annotation, as well as the locations of identical and nonidentical bases in the HSP, for 
instance. 


Example 12-2 includes a pair of subroutines; one to parse the alignments into their 
HSPs, and the second to extract the sequences and their end positions. The main program 
extends Example 12-1 using these new subroutines. 


Example 12-2. Parse alignments from BLAST output file 


#!/usr/bin/perl 
# Parse alignments from BLAST output file 


use strict; 
use warnings; 
use BeginPerlBioinfo; # see Chapter 6 about this module 





# declare and initialize variables 
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my $beginning annotation = ''; 
' 








my Sending annotation = ''; 
my Salignments = ( ); 

my Salignment = ''; 

my $filename = 'blast.txt'; 
my @HSPs = ( ); 


my (Séxpect, query, $query rangé, Ssubject, Gsubject range) 


Cg ee oot Vo oe Ts 


parse blast (\Sbeginning annotation, \S$ending annotation, 
\Salignments, $filename) ; 


Salignment = Salignments{'AK017941.1"'}; 





@HSPs = parse blast_alignment_HSP(Salignment) ; 


(Sexpect, Squery, Squery range, Ssubject, Ssubject_range) 
= extract _HSP_ information ($SHSPs[1]); 


# Print the results 





print "\n-> Expect value: Sexpect\n"; 

print "\n=> Query string: Squery\n"; 

print "\n-> Query range: Squery range\n"; 
print “\i=> Subject String: Seubject\n"; 

print “\n=> Subject range: subject range \n"; 
exit; 


Ht Ht HH HH HH HEH EH EH EH HH EH EE EH EE EE EH EE EH HH 
HT HH HHH HH HH HE HH HHH HF 

# Subroutines for Example 12-2 

a tH HH HE HEH HH EH EH EH EE EH EH EE EE EEE EE EE EH HE 
Ht tH HH HH HH HH EE HH HH HH 





# parse blast alignment _HSP 





# 

# --parse beginning annotation, and HSPs, 

# from BLAST alignment 

# Return an array with first element set to the 
beginning annotation, 

# and each successive element set to an HSP 


sub parse blast alignment HSP { 


my(Salignment ) = @ ; 





# declare and initialize variables 
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my Sbeginning annotation = ; 
TY 


my SHSP_section = ; 
my @HSPs = ( ); 


# Extract the beginning annotation and HSPs 
(Sbeqinning annotation, $HSP section ) 


= (Salignment =~ /(.*?) (* Score =.*)/ms) ; 





# Store the Sbegqinning annotation as the first entry in 
@HSPs 
push(@HSPs, Sbeginning annotation) ; 


# Parse the HSPs, store each HSP as an element in @HSPs 
while (SHSP section =~ /(* Score =.*\n) (*(?! Score 
=).*\n)+/gm) { 
push(@HSPs, S&); 








# Return an array with first element = the beginning 
annotation, 
# and each successive element = an HSP 


return (@HSPs) ; 


extract _HSP information 


# 

# 

# --parse a HSP from a BLAST output alignment section 
# - return array with elements: 

# Expect value 

# Query string 

# Query range 

# Subject string 

# Subject range 





sub extract ESP information { 
my(SHSP) = @ ; 


# declare and initialize variables 


my(Sexpect) = ''; 

my(Squery) = ''; 

my (Squery range) = ''; 

my(Ssubject) = ''; 

my (Ssubject_range) = ''; 

(Sexpect) = (SHSP =~ /Expect = (\S+)/); 
IT-SC 
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Squery = join ( '' , (SHSP =~ /*Query.*\n/gm) ); 
Ssubject = join ( ** , (SHSP =< /*Sbict.*\n/om) }3 


Squery range = join('..', ($query =~ 


# (Nr) .*\DQ\at) #8) )7 


Psubject range = join(’..", ($subject =< 


PUGH) 2 * DCs) fe) )% 
Squery =~ s/[*acgt]//g; 
Ssubject =~ s/[*acgt]//g; 


return (Sexpect, $query, $query range, Ssubject, 
Ssubject_ range) ; 
} 
Example 12-2 gives the following output: 
-> Expect value: 5e-52 


-> Query string: 
ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatgg 
g 
gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgg 
gtgtgacaactgcaatgagt 

ggttccatggggactgcatccggatca 

















-> Query range: 239 %..400 


-> Subject String: 
ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaat 
g 
ggctgagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatga 
ttggacttgtgacaactgca 


acgagtggttccatggagactgcatccggatca 





-> Subject range: 1048..1213 


Let's discuss the new features of Example 12-2 and its subroutines. First notice that 
the two new subroutines from Example 12-1 have been placed into the 
BeginPer|Bioinfo.pm module, so they aren't printed again here. 


The main program, Example 12-2, starts the same as Example 12-1; it calls the 
parse_blast subroutine to separate the annotation from the alignments in the BLAST 
output file. 


The next line fetches one of the alignments from the alignments hash, which is then 
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used as the argument to the parse_blast_alignment_HSP subroutine, which then returns 
an array of annotation (as the first element) and HSPs in @HSPs. Here you see that not 
only can a subroutine return an array on a scalar value; it can also return a hash. 


Finally, Example 12-2 does the lower-level parsing of an individual HSP by calling 
the extract_HSP_information subroutine, and the extracted parts of one of the HSPs are 
printed. 


Example 12-2 shows a certain inconsistency in our design. Some subroutines call their 
arguments by reference; others call them by value (see Chapter 6). You may ask: is 
this a bad thing? 


The answer is: not necessarily. The subroutine parse_blast mixes several arguments, and 
one of them is not a scalar type. Recall that this is a potentially good place to use call-by- 
reference in Perl. The other subroutines don't mix argument types this way. However, 
they can be designed to call their arguments by reference. 


Continuing with the code, let's examine the — subroutine 
parse_blast_alignment_HSP . This takes one of the alignments from the BLAST 
output and separates out the individual HSP string matches. The technique used is, once 
again, regular expressions operating on a single string that contains all the lines of the 
alignment given as the input argument. 


The first regular expression parses out the annotation and the section containing the HSPs: 
(sbeginning annotation, SHSP section ) 


= (Salignment =~ /(.*?) (* Score =.*)/ms); 

The first parentheses in the regular expression is (.*?) This is the nongreedy or 
minimal matching mentioned in Chapter 9. Here it gobbles up everything before the 
first line that begins Score = (without the ? after the *, it would gobble everything until 
the final line that begins Score =). This is the exact dividing line between the beginning 
annotation and the HSP string matches. 


The next loop and regular expression separates the individual HSP string matches: 


while (SHSP section =~ /(* Score =.*\n)(*(?! Seore 
=).*\nj+/om) { 


push(@HSPs, S&); 
} 


This is the same kind of global string match in a while loop you've seen before; it keeps 
iterating as long as the match can be found. The other modifier /m is the multiline 
modifier, which enables the metacharacters $ and “* to match before and after embedded 
newlines. 
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The expression within the first pair of parentheses—(* Score =.*\n)—matches a 
line that begins Score =, which is the kind of line that begins an HSP string match 
section. 


The code within the second pair of parentheses—(*(?! Score 
=) .*\n) +—matches one or more (the + following the other parentheses) lines that do 
not begin with Score =. The ?! at the beginning of the embedded parentheses is the 
negative lookahead assertion you encountered in Example 12-1. So, in total, the 
regular expression captures a line beginning with Score = and all succeeding adjacent 
lines that don't begin with Score =. 


12.5 Presenting Data 


Up to now, we've relied on the print statement to format output. In this section, I 
introduce three additional Perl features for writing output: 


e printf function 
e here documents 
e format and write functions 


The entire story about these Perl output features is beyond the scope of this book, but I'll 
tell you just enough to give you an idea of how they can be used. 


12.5.1 The printf Function 


The printf function is like the print function but with extra features that allow you to 
specify how certain data is printed out. Perl's printf function is taken from the C 
language function of the same name. Here's an example of a printf statement: 

my Stirs = - '32.14159205"; 

my Ssecond = 76; 
my $third = "Hello world!"; 





printf STDOUT "A float: %6.4f An integer: %-5d and a string: 
$s\n", 
Sfirst, Ssecond, Sthird; 


This code snippet prints the following: 


A float: 3.1416 An integer: 76 and a string: Hello 
world! 





The arguments to the printf function consist of a format string, followed by a list of 
values that are printed as specified by the format string. The format string may also 
contain any text along with the directives to print the list of values. (You may also 
specify an optional filehandle in the same manner you would a print function.) 
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The directives consist of a percent sign followed by a required conversion specifier, 
which in the example includes f for floating point, d for integer, and s for string. The 
conversion specifier indicates what kind of data is in the variable to be printed. Between 
the % and the conversion specifier, there may be 0 or more flags, an optional minimum 
field width, an optional precision, and an optional length modifier. The list of values 
following the format string must contain data that matches the types of directives, in 
order. 


There are many possible options for these flags and specifiers (some are listed in 
Appendix B). Here's what is in Example 12-3. First, the directive 36.4f specifies 
to print a floating point (that is, a decimal) number, with a minimum width of six 
characters overall (padded with spaces if necessary), and at most four positions for the 
decimal part. You see in the output that, although the $f floating-point number gives the 
value of pi to eight decimal places, the example specifies a precision of four decimal 
places, which are all that is printed out. 


The %-5d directive specifies an integer to be printed in a field of width 5; the - flag 
causes the number to be left-justified in the field. Finally, the 3s directive prints a string. 


12.5.2 here Documents 


Now we'll briefly examine here documents. These are convenient ways to specify 
multiline text for output with perhaps some variables to be interpolated, in a way that 
looks pretty much the same in your code as it will in the output—that is, without a lot of 
print statements or embedded newline \n characters. We'll follow Example 12-3 
and its output with a discussion. 


Example 12-3. Example of here document 


#!/usr/bin/perl 
# Example of here document 


use strict; 
use warnings; 


my SDNA = 'AAACCCCCCGGGGGGGGTTTTTT'; 
for( my 31 = 0 7% $1 <2 7 4+4$1 ) { 
print <<HEREDOC; 


On iteration $i of the loop! 
SDNA 


HEREDOC 
} 


exit; 
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Here's the output from Example 12-3: 
On iteration 0 of the loop! 
AAACCCCCCGGGGGGGGTTTTTT 


On iteration 1 of the loop! 
AAACCCCCCGGGGGGGGTTTTTT 
In Example 12-3, a here document was put in a for loop, so that you can see the $i 
variable changing in the printout. The variables are interpolated into a here document in 
the same way they are interpolated into a double-quoted string. Every time they go 
through the loop, the contents of the here document are subject to variable interpolation 
and are printed out. The terminating string used in this example, HEREDOC, can be any 
string you specify. (There are several options for dealing with things like indentation and 
so forth; I won't discuss them here and refer you to the Perl documentation.) Here 
documents are handy for some tasks, such as when you have a long, multiline document 
with just a few changes applied each time you print it. A business form letter, with only 
the addressee changed, is a typical example. Using a here document preserves the look 
of the final output in the code, while allowing variable interpolation. 





12.5.3 format and write 


Finally, let's take a look at the format and write functions. format is designed to 
generate reports and can handle page numbers, headers, and various layout options such 
as centering and left and right justification. It's modelled on the FORTRAN 
programming-language conventions for formatting and so is particularly handy for 
producing reports based on that style, such as the PDB file format, in which fields are 
specified as occupying certain columns on the line. 

Example 12-4 is a short example of a format that creates a FASTA-style output. 


Example 12-4. Example of format function to produce FASTA output 


#!/usr/bin/perl 
# Create fasta format DNA output with "format" function 


use strict; 
use warnings; 


# Declare variables 
my Sid = 'AOOOO'; 


my S$description = 'Highly weird DNA. This DNA is so 
unlikely!'; 
my SDNA = 


"AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTT 
LTE L ; 


# Define the format 
format STDOUT = 
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# The header line 

>O<<<< << KKK AK KKK KKK KK KK KKK KK KK KK, 
S10, Sdescription 

# The DNA lines 

REE KKK KKK KKK KK KK KKK KK KK KK KK KK KKK 
SDNA 


# Print the fasta-formatted DNA output 





write; 
exit; 

Here's the output of Example 12-4: 

>A0000 Highly unlikely DNA. This DNA is so... 





AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTT 
le El ee 


After declaring and initializing the variables that fill in the form, the form is defined with: 
format STDOUT = 

and the format continues until it reaches the line with a period at the beginning. 

The format is composed of three kinds of lines: 

A comment beginning with the pound sign # 

A picture line that specifies the layout of text 

An argument line that names the variables that fill in the preceding picture line 


The picture line and the argument line must be adjacent; they can't be separated by a 
comment line, for instance. 


The first picture line/argument line combo is for the header information: 


pCR COC CCCL OU OCC OC OC CC ECC CCC CEC C LEE 
Sad, Sdescription 


The picture line has two picture fields in it, associated with the variables $id and 
Sdescription, respectively. The picture line begins with a greater-than sign, >, which 
is just text that begins each FASTA file header line, by definition. Then comes the first 
picture field, which is an @ sign followed by nine < signs. The @ sign declares a field that 
has the associated variable interpolated into it. The use of the nine less-than signs 
specifies that the value should be left-justified, for a total of 10 columns. If the value is 
bigger than 10 columns, it is truncated. A less-than sign left-justifies, a greater-than sign 
right-justifies, and a vertical bar | centers the data in the field. 
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The second picture field is almost identical. It is longer and ends with three dots (an 
ellipsis) which prints if the contents of the variable Sdescription can't fit into the 
length of the picture field (which, in this case, is true.) 


The next pair of picture/argument lines is: 


KEK KK KKK KKK KK KK KK KK 
SDNA 


The picture field starts with a caret, which declares a picture field that will handle 
variable-length records. The line also contains 49 less-than signs, for a total of 50 
columns, left-justified. At the end are two tilde ~ signs, which indicate there should be 
additional lines for the data if it doesn't fit one on one line. 


The write command simply prints the previously defined format. By default, the output 
goes to STDOUT, as is done in the example, but you can supply a filehandle to the 
format and write statements if you desire. 


The upcoming release of Perl 6 will move formats out of the core of the 
language and make them into a module. Details are not available as of 
this writing, but this change will probably entail adding a statement 
such aS use Formats; near the top of your code in order to load the module for 
using formats. 


12.6 Bioperl 


The Bioperl project is an important collection of Perl code for bioinformatics that has 
been in development since 1998. Although Bioperl uses the more advanced object- 
oriented style of Perl program design, it's possible to take an introductory look here at 
how it's organized and used. 


The main focus of Bioperl modules is to perform sequence manipulation, provide access 
to various biology databases (both local and web-based), and parse the output of various 
programs. 


Bioperl is available at http://www. bioperl.org/. Some of its features rely on having 
additional Perl modules—available from CPAN (http://www.cpan.org/)—installed. 
This situation is quite common, and as you do more Perl programming, you'll become 
familiar with installing modules from CPAN. The Bioperl tutorials include information 
on installing Bioperl and additional modules for the three major operating systems: Unix 
or Linux, Mac, and Windows. 


Bioperl doesn't provide complete programs. Rather, it provides a fairly large—and 
growing—set of modules for accomplishing common tasks, including some tasks you've 
seen in this book. You're responsible for writing the code that holds the modules together. 
By providing these ready and (usually) easy-to-use modules, Bioperl makes developing 
bioinformatics applications in Perl faster and easier. There are example programs for 
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most of the modules, which can be examined and modified to get started. 


Like many open source projects, Bioperl has suffered from fragmentation and uneven 
documentation, due to the strictly volunteer and geographically dispersed group of 
contributors. But recent work on the project leading up to Release 0.7 in March 2001 has 
significantly improved the project. In particular, there is now enough tutorial information 
on using the modules to enable you to make good use of the code. 


Some difficulties still remain. Most of the code has been developed on Unix or Linux 
systems. Not all of it works on Macs or Windows operating systems, but most will. There 
are some documents available at the Bioperl web site that discuss using Bioperl on non- 
Unix computers, but the bottom line is that you might find that some things don't work. 


If you're going to give Bioperl a try (and I strongly recommend you do), you should make 
sure you have a fairly recent version of Perl installed. You'll need at least Version 5.004; 
it would be much better to install the latest stable release from the Perl web site 


http://www.perl.com. 
12.6.1 Sample Modules 


To give you an idea of what tasks Bioperl can make easier for you, Table 12-1 displays 
a representative sample of some of the most useful modules available. 
Table 12-1. Bioperl modules 
Module Description 


Bio::Seq Sequence object, with features 


Bio: :SimpleAlign Multiple alignments held as aé_e set of 


sequences 
Bio: : Species Generic species object 
Bio: :DB: :Ace Database object interface to ACeDB servers 


Database object interface to GDB HTTP 





Bio::DB::GDB 

query 
Bio: :DB::GenBank Database object interface to GenBank 
Bio::DB::GenPept Database object interface to GenPept 
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Bio:: 


Bio:: 


Bio: 


Bio: 


Bio:: 


Bio:: 


Bio:: 


Bio: 


Bio:: 


Bio:: 


Bio:: 


Bio: 


Bio:: 


Bio: 





DB: :NCBIHelper 


DB: : SwissProt 


‘Index: :Fasta 


‘Index: : GenBank 


Location: :Simple 


Location: : Split 


SeqFeature: :FeaturePair 


:SeqFeature: :Generic 


SeqFeature: : Similarity 


SeqFeature: :SimilarityPair 


SeqFeature: :Gene:: 


:SeqFeature::Gene:: 


SeqFeature::Gene:: 


:SeqFeature: :Gene:: 
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Exon 


GeneStructure 


Transcript 


TranscriptI 


A collection of routines useful for queries to 
NCBI databases 


Database object interface to SWISS-PROT 
retrieval 


Interface for indexing FASTA files 


Interface for indexing GenBank seq files, 
that is, flat files in GenBank format 


Implementation of a simple location on a 
sequence 


Implementation of a location on a sequence 
that has multiple locations 


Holds pair feature information, e.g., BLAST 
hits 


Generic SeqFeature 


Sequence feature based on similarity 


Sequence feature based on the similarity of 
two sequences 


Feature representing an exon 


Feature representing an arbitrarily complex 
structure of a gene 


Feature representing a transcript 


Interface for a feature representing a 
transcript of exons, promoter, UTR, and a 
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Bio:: 


Bio:: 


Bio: 


Bio:: 


Bio:: 


Bio: 


Bio:: 


Bio:: 


Bio: 


Bio: 


Bio: 


Bio: 


Bio:: 





Tools:: 


Tools:: 


‘Tools:: 


Tools:: 


Tools:: 


‘Tools:: 


Tools: 


Tools:: 


‘Tools:: 


‘Tools:: 


‘Tools:: 


‘Tools:: 


Tools:: 
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Blast 


BPbl2seq 


BPlite 


BPpsilite 


CodonTable 


Fasta 


‘IUPAC 


RestrictionEnzyme 


SeqPattern 


SeqStats 


SeqWords 


Blast::HSP 


Blast: : HTML 


poly-adenylation site 


Bioperl BLAST sequence analysis object 


Lightweight BLAST parser for 
sequence alignment using’ the 
algorithm 


pair-wise 
BLAST 


Lightweight BLAST parser 


Lightweight BLAST parser for PSIBLAST 


reports 


Bioperl codon table object 


Bioperl FASTA utility object 


Generates unique seq objects from an 
ambiguous seq object 


Bioperl object for a restriction endonuclease 
object 


Bioperl object for a sequence pattern or 
motif 


Object holding statistics for one particular 
sequence 


Object holding n-mer statistics for one 
sequence 


Bioperl BLAST high-scoring segment pair 
object 


Bioperl utility module for HTML-formatting 


338 


BLAST reports 


Bio: :Tools: : Blast: :Sbjct Bioperl BLAST "hit" object 


Bioperl module for running BLAST analyses 


Bio: :Tools::Blast::Run::LocalBlast 
locally 


Bioperl module for running BLAST analyses 


Bio: :Tools::Blast::Run::Webblast using an HTTP interface 


Bio: : Tools: :Prediction: :Exon Predicted exon feature 
Bio: : Tools: :Prediction: :Gene Predicted gene structure feature 
Bio: : Variation: :AAChange Sequence change class for polypeptides 


Point mutation and codon information from 


Bio: : Variation: :AAReverseMutate : ; F 
single amino acid changes 


Bint Varaton=Alléle Sequence object with allele-specific 


attributes 
Bio: : Variation: :DNAMutation DNA-level mutation class 
Bio: : Variation: :IO Handler for sequence variation I/O formats 





12.6.2 Bioperl Tutorial Script 


Bioperl has a tutorial script to help you try out various parts of the package. In this 
section, I'll show how to start up and run some example computations. 


I've mentioned already that you should learn how to download code from CPAN in order 
to add modules such as Bioperl. A great deal of the usefulness of the Perl programming 
environment now resides in these modules available on CPAN. This was a design 
decision: by concentrating on the core Perl language, the Perl designers can focus on 
making the language as good as they can. The Perl module developers can then 
concentrate on their many modules. By all means, take a look around the CPAN web site 
for an idea of the wealth of Perl modules available to you. 
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I won't give the details of how to install Bioperl here: as mentioned, they are available at 
the Bioperl web site, or you can visit the CPAN web site for information. 


So, let's assume you've installed the Bioperl module and looked over the tutorial at the 
Bioperl web site. Now, let's see how to try out some Bioper! programs. 


Go to the directory where the Bioperl software has been built on your system. For 
instance, on my Linux computer, I put the download file bioperl-O. 7.0.tar.gz into the 
directory /usr/local/src, and then unpacked it with the command: 

tar xvzf bioperl-0.7.0.tar.gz 


which creates the source directory /usr/local/src/bioperl-0.7.0. After installing the 
module (check the documentation), you're ready to run the tutorial script. 


Change to the source directory and type perl bptutorial.pl. Here's the result (I've 
shown the head of the tutorial to give the author and copyright information): 


& head bptutorial.pl 
# SId: chl2,v 1.44 2001/10/10 20:37:42 troutman Exp mam $ 








=headl BioPerl Tutorial 
Cared for by Peter Schattner <schattner@alum.mit.edu> 
Copyright Peter Schattner 


This tutorial includes "snippets" of code and text from 
various 

Bioperl documents including module documentation, 
example scripts 


ie) 


% perl bptutorial.pl 





The following numeric arguments can be passed to run the 
corresponding demo-script. 
i => 26cess remote db , 
2 => index local db , 
3 => fetch_local db, (# NOTE: needs to be 
run with demo 2) 
4 => sequence manipulations , 
2 => seqstats and seqwords , 
=> FPestriction and sigcleave , 
Suner Seq Mili ties, 
=> Fun standaloneblast , 
=> blast. parser , 
li? => Dplite parsing , 
il => hmmer parsing ; 
iZ => £unm clustalw toeitee , 

















119 © AUD 
II 
Vv 
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=> run_psw_bl2seq , 

=> simplealign univaln , 

=> G6ne PFrSecicLion parsing , 
sequence annotation , 

=> largeseqs , 

=> liveseqs , 

=> demo variations , 

=> demo xml , 














CoO MAAN A UB WwW 
Vv 


NS ES EY pe eh pe pe sp 
ra eS SS eS a 


In addition the argument "100" followed by the name of a 
Single 
bioperl object will display a list of all the public 
methods 

available from that object and from what object they are 
inherited. 





Using the parameter "0" will run all tests. 
Using any other argument (or no argument) will run this 
display. 


So typical command lines might be: 
To run all demo scripts: 
> perl -w bptutorial.pl 0 
or to just run the local indexing demos: 
> perl -w bptutorial.pl 2 3 
or to list all the methods available for object 
Biot rTools7iSeqSt 





























Eats: = 
> perl -w bptutorial.pl 100 Bio::Tools::SeqStats 























oe 


Now let's try option 9, the BLAST parser, and option 1, access remote db. So here 
goes, starting with the BLAST parser: 


ie) 


& perl bptutorial.pl 9 


Beginning blast.pm parser example... 














QUERY NAME : gi|1401126 

QUERY DESC : UNKNOWN 

LENGTH : 504 

FILE : t/blast.report 

DATE : Thu, 16 Apr 1998 18:56:18 -0400 
PROGRAM : TBLASTN 

VERSION : 2.0.4 [Feb-24-1998]</b> 

DB-NAME : Non-redundant GenBank+EMBIL+DDBJ+PDB 
sequences 

DB-RELEASE : Apr 16, 1998 9:38 AM 
DB-LETTERS : 677679054 
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DB-SEQUENCES » 336123 























GAPPED > YES 

TOTAL HITS > LOO 

CHECKED ALL >: YES 

Fit FUNG : NO 

SIGNIF HITS 7 4 

SIGNIF CUTOFF : 1.0e-05 (EXPECT-VALUE) 
LOWEST EXPECT : 0.0 

HIGHEST EXPECT : le-O5 

HIGHEST EXPECT : 7.6 (OVERALL) 

MATRIX : BLOSUM62 

FILTER : NONE 

EXPECT : 10 

LAMBDA, K, H >: 0.270, 0.0470, 0.230 (SHARED STATS) 
WORD SIZE 2 13 

S >: 42, 74 (SHARED STATS) 
GAP CREATION 2 del 

GAP EXTENSION : J] 





Number of hits is 4 

Fraction identical for hit 1 is 0.25 

Sequence identities for hsp of hit 1 are 66-68 70 73 76 79 
80 87-89 114 117 

119 131 144 146 149 150 152 156 162 165 168 170 171 176 
178-182 184 187 190 

191 205-207 211 214 217 222 226 241 244 245 249 256 266-268 
270 278 284 291 
296 304 306 309 3] 


me 


fo} 


















































is 
W 


16 319 324 





This is an interesting way to parse BLAST output! Now let's look at the access of the 
remote DB: 


6 perl bptutorial.pl 1 

Beginning remote database access example... 

seql display id is MUSIGHBA1 

seq2 display id is AF303112 

Display id of first sequence in stream is AF041456 


oe 


[e) 











Well, that was less informative as an output, but it seems you can infer that the remote 
DB access was successful. (By the way, if you're unsuccessful with this, it may be that 
you're behind a firewall which is denying access—a not uncommon occurrence in 
universities or large companies.) 


The documentation suggests running the bptutorial.p/ script under the Perl debugger to 
watch what happens step by step. I concur with that suggestion but won't include the 
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output here. Try it yourself! 


Since that last example wasn't much fun, let's try one more: here's the sequence 
manipulation tutorial: 


ie) 


6 perl bptutorial.pl 4 


Beginning sequence manipulations and SeqIO example... 

First sequence in fasta format... 

>Testl 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGT 
iC 
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG 
G 
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTA 
Cc 

ACAACATCCATGAAACGCATTAGCACCACC 

Seq object display id is Testl 

Sequence is 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGT 
CTGATAG 
CAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACT 
AAATACTTTAACCAATATA 
GGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAG 
CACCACC 

Sequence from 5 to 10 is TTTCAT 

Acc num is unknown 

Moltype is dna 

Primary id is Testl 

Truncated Seq object sequence is TTTCAT 

Reverse complemented sequence 5 to 10 is GTGCTA 

Translated sequence 6 to 15 is LQRAICLCVD 

















Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults +. LRK* 
ctgagaaaataa translated as a coding region (CDS): MRK 





Translating in all six frames: 
frame: O forward: LRK* 




















frame: 0 reverse-complement: LFSO 
frame: 1 forward: *ENX 
frame: 1 reverse-complement: YFLX 
frame: 2 forward: EKI 
frame: 2 reverse-complement: IFS 
Translating with all codon tables using method defaults: 
1 3 LRK* 
Z %& AiktK* 
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TRK* 
LRK* 
LSK* 

: LRKO 
: LSN* 
: LRK* 
LRK* 
SRK* 
LGK* 
LSNY 
LRK* 
LRK* 
LSN* 


ro DD OB W 





FG): OT wR Go INDY Fe 


1 a ae ae ee ec 
fe ee ee ee ee ee If 


oe 


That was more fun, because this part of Bioperl is doing several things we've done in this 
book. 


I hope this brief look at Bioperl has whetted your appetite for more. It's a good idea to 
explore this set of modules. A Perl module for parsing BLAST output called BPLite.pm 
may also be of interest: it's now part of the Bioperl project. 


12.7 Exercises 


Exercise 12.1 


Basic string matching. Write a program that looks for a query string in a 
target string. For instance, if the query string is "gone", it finds a match at position 
22 of the target string "goof through the way-gone-osphere." Don't use regular 
expressions or any of Perl's built-in string-matching abilities; instead, examine 
individual positions in the strings, compare characters, and invent your own 
algorithm. 


Exercise 12.2 


Explore the NCBI BLAST web pages at 
http://www.ncbi.nim.nih.gov/BLAST. Familiarize yourself with the 


purpose and use of the various component programs and read the tutorial 
information on the meaning of the statistics. 


Exercise 17.3 
Explore the Bioperl web pages at http://www.bioperl.org. Download the 


code and install it on your computer. 
Exercise 12.4 


Perform BLAST searches at the NCBI web site. Search with DNA against DNA 
databases; then search with the same DNA against protein databases, and compare 
the output. 
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Exercise 12.5 


Perform two BLAST searches with related sequences. Parse the BLAST output of 
the searches and extract the top 10 hits in the header annotation of each search. 
Write a program that reports on the differences and similarities between the two 
searches. 


Exercise 12.6 


Write a program that uses Bioperl to perform a BLAST search at the NCBI web 
site, then use Bioperl to parse the BLAST output. 


Exercise 12.7 


Using Bioperl modules mixed with your own code, write a program that runs 
BLAST on a set of DNA sequences and saves the IDs of the list of hits of each 
BLAST run sorted in arrays. Allow the user to view each list, to view hits in 
common between multiple lists and hits unique to one of multiple lists. For each 
hit, enable the user to fetch its entire GenBank record. 


Example 12.8 


Write an explanation of _ the code for the subroutine 
extract_HSP_information. Be sure to refer to the format of the data the code 
uses as input. 
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Chapter 13. Further Topics 


This book's goal has been to help you learn basic Perl programming. In this chapter, 
I will point the way to further learning in Perl. 


13.1 The Art of Program Design 


My emphasis on the art of program design has determined the way in which the programs were presented. 
They've generally progressed from a discussion of problems and ideas, to pseudocode, to small groups of 
small, cooperating subroutines, and finally to a close-up discussion of the code. At several points you've 
seen more than one way to do the same task. This is an important part of a programmer's mindset: the 
knowledge of, and willingness to try, alternatives. 


The other recurrent theme has been to explain the problem-solving strategies programmers rely on. These 
include knowing how to use such sources of information as searchable newsgroup archives, books, and 
language documentation; having a good working knowledge of debugging tools; and understanding basic 
algorithm and data structure design and analysis. 


As your skills improve, and your programs become more complex, you'll find that 
these strategies take on a much more important role. Designing and coding 
programs to solve complex problems or crunch lots of complex data requires 
advanced problem-solving strategies. So it's worth your while to learn to think like a 
computer scientist as well as a biologist. 


13.2 Web Programming 


The Internet is the most important source of bioinformatics data. From FTP sites to 
web-enabled programs, the Perl-literate bioinformatician needs to be able to access 
web resources. Just about every lab has to have its own web page these days, and 
many grants even require it. You'll need to learn the basics about the HTML and XML 
markup languages that display web pages, about the difference between a web 
server and a web browser, and similar facts of life. 


The popular CGI.pm module makes it fairly easy to create interactive web pages, and 
several other modules are available that make Internet programming tasks relatively 
painless. For instance, you can write code for your own web page that enables visitors to 
try out your latest sequence analyzer or search through your special-purpose database. 
You can also add code to your own programs to enable them to interact with other web 
sites, querying and retrieving data automatically. Collaborators who are geographically 
diverse can use such web programming to work cooperatively on a project. 


13.3 Algorithms and Sequence Alignment 


You will want to spend some time exploring the standard results in algorithms, as found 
in the texts recommended in Appendix A. A good place to start is the basic sequence 
alignment methods such as the Smith-Waterman algorithm. In terms of algorithms, the 
topics of parallelization, randomization, and approximation deserve at least a nodding 
acquaintance. 
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Sequence alignment is the subset of the family of algorithms called string matching 
algorithms that are used to find the extent of identity or similarity, or to find 
evidence of homology, between sequences. The Smith-Waterman algorithm, the 
treatment of gaps, the use of preprocessing, parallel techniques, the alignment of 
multiple sequences, and more are facets of this study. 


13.4 Object-Oriented Programming 


Object-oriented programming is a style of program design that provides a well- 
defined interface to data and subroutines (called methods in "OO-speak"). It's not 
hard to learn; it makes some things easy that would otherwise be hard (and vice 
versa, but you don't have to use it for everything!). A great deal of Perl code has 
been written in object-oriented style since the capability was added to the language 
a few years ago. 


13.5 Perl Modules 


I've frequently mentioned modules and CPAN—the large collection of Perl code—has 
a huge number of modules you can use. Most are free, but do make a point of 
checking for copyright restrictions and see the discussion in the Perl FAQs about 
copyright issues. These days, most modules, including the bulk of the code available 
on CPAN, are written in an object-oriented style. You'll need to extend your Perl 
knowledge to encompass this style, but you won't need an in-depth view of object- 
oriented techniques to use most modules in your programs. 


13.5.1 Bioperl 


An important and steadily developing suite of Perl modules for bioinformatics is the 
Bioperl project, which you can find at the web site http://www.bioperl.org. These 
modules give you lots of capabilities, all ready to use. 


13.6 Complex Data Structures 


Perl can handle complex data structures. This is useful in many programming 
situations; it's also necessary to learn in order to read a lot of existing Perl code that 
might come your way. 


For example, in this book, you've parsed a lot of data. To do so, you developed 
groups of subroutines, each fairly short, and each parsing different levels of the 
structure of the data. By using complex data structures, you can store your parse in 
a form that reflects the structure of the data. This, combined with object-oriented 
methods for accessing the parsed data, is a useful way to accomplish a parse. 


Complex data structures depend on references, which I've touched on in discussions of 
call by reference and of File: :Find. 


13.7 Relational Databases 


Relational databases are another area Perl programmers and bioinformaticians need 
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to explore. There comes a time when flat files or DBM just won't do for managing the 
data of a medium- or large-sized project, and you must turn to relational databases. 
Although they take a bit more effort to set up and program, they offer a standard 
and reliable way to store data and ask questions about it. In this book, we briefly 
discussed relational databases and actually used a simple DBM database. In the 
course of your work, however, you're likely to encounter Oracle, MySQL, PostgreSQL, 
Sybase, and others.The Perl module DBI, which stands for Database Independence, 
makes it possible to write code for manipulating relational databases that doesn't 
depend (too much) on which database you're actually using. 


The fact is, writing code to handle databases isn't hard to do. The hardest part is 
making sure that the database is installed with the proper libraries, that the proper 
Perl modules are in place, and that you know how to connect to the database from 
your program. Once you have those things in place, using the database is generally 
easy. 


That said, relational databases have their own lore, and there is a substantial body of 
knowledge about designing and managing good databases. Many programmers 
specialize in these issues, and that's true for plenty of bioinformaticians as well, 
since there are many interesting research questions related to designing better 
biological databases. 


13.8 Microarrays and XML 


Microarrays (miniaturized chip-based "laboratories" for studying gene expression) 
and XML (Extensible Markup Language) are two modern developments that are 
coming together. Now that whole genomes are available, microarray techniques 
enable you to measure the relative levels of thousands of gene transcripts at a time, 
and with their help, we hope to unravel the many pathways and interactions between 
the thousands of genes and gene products in the cell. XML is, to be painfully brief, a 
kind of new and improved HTML that is emerging as a standard for storing and 
interchanging data. (This book was written making extensive use of XML.) XML is 
becoming an important interface to many new kinds of experimental data. 


13.9 Graphics Programming 


Good graphical representation of data is critical for making your results useful to 
your colleagues. Graphics programming language present data and results and 
interact with software applications via attractive and easy-to-navigate interfaces. 
Many bioinformatics programs deal with large amounts of data, and a graphical user 
interface (GUI) can mean the difference between an application that helps you do 
your work and one that wastes your time. GUIs such as those commonly found on 
web pages are important not only for the display of output but also for the collection 
of user input. 


The point-and-click method of interacting with software applications is a basic 
standard. A good GUI makes an application or program much easier to use. One 
difficulty of GUIs and graphic data displays, however, is that they tend to be less 
portable than programs with simpler graphics. You may want to explore the graphics 
capabilities of such Perl modules as Tk and GD, among others. 
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13.10 Modeling Networks 


Networks of interacting biological systems, such as genes and gene products, can be 
modeled and investigated using graph algorithms. Despite the similarity to the term 
"graphics," graph algorithms are aé different entity based on the discrete 
mathematical field of graph theory. Algorithms on graphs and their many variants 
(such as Petri nets) can store and investigate the properties of biochemical pathways 
and intra- and intercellular signalling pathways, for example. 


13.11 DNA Computers 


For the forward-thinking scientist, it is interesting and instructive to learn about new 
trends in computing such as DNA computers, optical computing, and quantum 
computing. DNA computers are especially fun. They use standard molecular biology 
laboratory techniques as a model of a general-purpose computer. They can 
implement algorithms, store data, and in general behave like a "real" computer. 
They are impractical as of this writing, but they are really fun to think about, and 
someday, who knows? 
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Appendix A. Resources 


There is a wide array of resource material for Perl and for bioinformatics 
programming. This list is not at all exhaustive, but it includes those resources, both 
online and in print, that I think you may find interesting and useful as you expand 
your Perl programming repertoire. 


A.1 Perl 


The documentation for Perl is extensive. It includes lists of FAQs (Frequently Asked 
Questions, with answers), tutorials, precise definitions in the form of Unix-style 
manpages, and discussions of specific areas. There are various web sites, a well- 
organized storehouse of useful Perl programs called CPAN, newsgroups that have 
searchable archives, conferences, and many good books. It's also worth your while 
to find and cultivate your own local Perl community. Don't be afraid to engage your 
colleagues, though as your programming skills grow, they're liable to start asking 
you questions! 


As I've mentioned before, Perl is free. It's part of the wider open source movement, 
which includes such developments as Linux, the Apache web server, and so on. Since 
Perl is free, it relies on a community of interested parties to develop code and to 
write documentation. Because of this, you may notice that a lot of the 
documentation is a bit fragmented (or, in some cases, very fragmented). Still, the 
level of support for all these projects equals that available for the best of the 
commercial software packages. 


A.1.1 Web Site 


http ://www.perl.com 


This is the starting point for all things Perl. By all means, explore it. From here, 
you'll find many more sites dedicated to various aspects of Perl programming. 
Among several, you might find http://www.perl.org especially useful. 





A.1.2 CPAN: Comprehensive Perl Archive Network 


http://www.cpan.org/ 





The Comprehensive Perl Archive Network is an important resource and is the place to look for Perl 
modules. It's also a repository for other software, documentation, and web links. Before taking the time to 
write a program yourself, look here first to see if it has already been written. 


A.1.3 FAQs: Frequently Asked Questions 


http ://www.perl.com/pub/v/fags 


FAQs are a compendium of the most common questions newcomers ask, 
along with answers, that are usually quite helpful. As a beginning 
programmer, it is a good idea to take the time to read the Perl FAQs— 
skimming as necessary—in order to get the lay of the land. 
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You should spend at least enough time reading them to get an idea of what sorts of 
questions are archived in the FAQs. Be sure to check the FAQs before asking for help 
from a local expert or posting to a newsgroup. Repeatedly asking questions that 
have already been exhaustively answered in the FAQs, especially on the Perl 
newsgroups, might be considered irritating. 


You'll find that the Perl FAQs are divided into several parts. When consulting FAQs, 
look for the date when they were last updated; this isn't a big problem with Perl, but 
in general, you can find lots of out-of-date information on the Web. 


A.1.3.1 Beginners 


There are several documents aimed at beginners in the FAQs and in the documentation. 
There are some other beginning books besides this one, mentioned elsewhere in this 
appendix. There are also some online tutorials and beginners! articles about Perl at 
http://learn.perl.org (this is new as I write but looks very promising). There are also a 
number of mailing lists you can subscribe to, including a mailing list called 
beginners@perl.org, which you can subscribe to by visiting http://lists.perl.org. 











A.1.4 Online Manuals 


http ://www.perl.com/pub/v/documentation 


The Perl manual is available online at the Perl web site mentioned earlier. It 
should also be installed on your computer. You can access it by typing perldoc 
perl. On Unix/Linux systems, you can also type man perl to get the beginning 
manpage. As that explains, the manual is split into several pages. For instance, to 
find the manual for Perl's built-in functions, type man perlfunc or perldoc. 
HTML versions of the manual exist, and they can be installed on your local 
computer. This is my preferred method of accessing the documentation: it gives 
you links that make navigating easier, and if it's installed locally, you can use it 
even when you're not connected to the Internet. 


A.1.5 Books 


There are lots of Perl books. Many of them are excellent; some are not. Here's a 
short list of the Perl books I've found most useful in my own work. 


Programming Perl, Third Edition; by Larry Wall, Tom Christiansen, and Jon Orwant; 
O'Reilly & Associates. This is the standard book on Perl by the creator of the language. It 
explains pretty much everything, although it can lag behind the latest version of Perl. So 
the absolute authority for your installation should be the online manuals. Programming 
Perl covers a lot of ground; it's good as a reference, a tutorial, and as a ripping yarn if 
you're into that sort of thing. It presents some of the philosophy behind the language, so 
it's a good way to absorb some of the computer-science mindset. Earlier editions, if you 
happen to have them, will also serve; I'm particularly fond of the first edition. 

Perl Cookbook, by Tom Christiansen and Nathan Torkington, O'Reilly & Associates. 
This is billed as the companion volume to Programming Perl, and so it is. Here, you will 
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find examples that use Perl for different tasks. It's a great help in many situations, and if 
you will be doing much Perl programming, it's worth taking at least a few hours to peruse 
it. 

Mastering Algorithms with Perl; by Jon Orwant, Jarkho Hietaniemi, and John Macdonald; 
O'Reilly & Associates. I've mentioned the importance of studying algorithms and this 
fine book presents many important algorithms in the context of Perl. It explains concepts 
and gives code; it doesn't, however, teach the mathematics of analyzing and measuring 
algorithms. Really serious algorithms students will find that information in texts such as 
Introduction to Algorithms by Corman, Leiserson, and Rivest. Even if you're a novice 
programmer, this is still a valuable book, and you'll find lots of code you'll be able to use. 

Mastering Regular Expressions; by Jeffrey R. Friedl, O'Reilly & Associates. A good 
book on an important topic with excellent coverage of Perl. 

Elements of Programming in Perl, by Andrew L. Johnson, Manning Publications. This is 
another book intended for beginners. It's very good, and I recommend it as a supplement 
to this text. 

Learning Perl, Third Edition; by Randall L. Schwartz and Tom Christiansen; O'Reilly & 
Associates. This is the classic tutorial book on Perl. It's well-written and well-organized. 
If you've gotten through Beginning Perl for Bioinformatics, you should have no trouble 
with Learning Perl. 

Object-Oriented Perl, by Damian Conway, Manning Publications. A superb book on the 
topic suitable for the beginning or advanced programmer. 


A.1.6 Conference 


O'Reilly Open Source Convention. This convention now includes the yearly Perl 
Conference. It's a chance to attend classes and lectures and meet Perl practitioners of all 
sorts. There are also regular YAPC (yet another Perl conference) meetings; you'll find the 
details at the main Perl web site. 


A.1.7 Newsgroups 


Perl newsgroups are an important resource for programmers. If you've never seen 
them, they're accessible over the Web (among other ways). They give you the ability 
to write a message to a large group of people with interests in any of hundreds of 
specific topics. If you have a question that you haven't been able to answer in the 
Perl documentation or the FAQs, searching the newsgroups for the topic of your 
question can often result in an answer. You can also post a question to a newsgroup 
if you can't find an answer already provided: but this is not often necessary. 


I want to emphasize how useful this resource is. The drawback is that there tends to 
be a "low signal-to-noise ratio": in other words, there's often a lot of uninformative 
material in newsgroups. But it can be worth wading through; even negative 
responses (no known solution to the problem) can save you time and effort. 


There are a number of newsgroups related to Perl in the comp.lang.perl hierarchy. 
The search engine deja.com (recently sold to google.com but still available) lets you 
search the archives of these newsgroups. More information is available in the Perl 
FAQs about specific newsgroups; for instance, many specific Perl modules have their 
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own newsgroups, mailing lists, or web sites. The CPAN web site is another place to 
find searchable newsgroup archives. 


A.2 Computer Science 


Even though you're programming for biological applications, you'll often find yourself 
venturing into the realm of traditional computer science. Here are some published 
resources to help you find your way. 


A.2.1 Algorithms 


Mastering Algorithms with Perl; by Jon Orwant, Jarkho Hietaniemi, and John Macdonald; 
O'Reilly & Associates. The best book for noncomputer scientists who program in Perl. 
Introduction to Algorithms; by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. 
Rivest; MIT Press and McGraw-Hill. This is a really good book on algorithms—in many 
ways, the best. It's one of the standard university texts (arguably the standard text) at both 
the graduate and undergraduate levels. It works well as a textbook and as a reference. Its 
target audience is computer-science students, so there is a fair amount of math included, 
but even nonmathematical programmers will find this book very helpful. 

Fundamentals of Algorithmics, by Gilles Brassard and Paul Bratley, Prentice Hall. An 
easy overview of algorithmic techniques. 

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational 
Biology; by Dan Gusfield; Cambridge University Press. This book specializes in 
algorithms for strings, including such topics as sequence alignment. It's very detailed, but 
even so, not complete: this is a big field! The best single source on string algorithms, with 
lots of information about biological sequence similarity. 


The following books are for advanced study. 


The Design and Analysis of Computer Algorithms; by Alfred V. Aho, John E. Hopcroft, 
and Jeffrey D. Ullman; Addison-Wesley. This is the classic book on the science of 
algorithms. 

Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes; by 
Frank Thomson Leighton; Morgan Kaufmann. A comprehensive and rigorous text and 
reference. 

Randomized Algorithms, by Rajeev Motwani and Prabhakar Raghavan, Cambridge 
University Press. A clear, rigorous book. 


A.2.2 Software Engineering 

Software Engineering, Second Edition; by Ian Sommerville; Addison-Wesley. A good, 
general book that covers lots of important topics and generally avoids taking sides for or 
against competing styles. 


A.2.3 Theory of Computer Science 


Introduction to Automata Theory, Languages, and Computation, Second Edition; by 
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John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman; Addison-Wesley. The classic 
text on computer-science theory. 

Computers and Intractability: A Guide to the Theory of Np-Completeness, by Michael 
R. Garey and David S. Johnson, W.H. Freeman & Co. The classic, and superb, book on 
the topic. 


A.2.4 General Programming 


The Unix Programmers Manual, Steven V. Earhart, ed., Harcourt, Brace and Jovanovich 
School. This manual for Unix (whatever version of Unix) is a crash course in computer 
science with an emphasis on programming. The design of the interacting programs, and 
the concepts of pipes, redirection, processes, and so on, has been one of the great success 
stories of programming. This manual summarizes the system: Part I documents user 
programs; Parts II and III document the programming interface. The programmable shell, 
and the programs grep, awk, and sed were some of the primary inspirations for Perl. 

The C Programming Language, by Brian W. Kernighan and Dennis M. Ritchie, Prentice 
Hall PTR. C and C++ are important languages in bioinformatics, and this classic book 
teaches C. If you work through the book, attempting all the programming exercises, 
you'll have some excellent programming training. 

Structure and Interpretation of Computer Programs; by Harold Abelson, Gerald Jay 
Sussman, and Juke Sussman; MIT Press. A really interesting book that looks deeply at 
programming in the context of learning a dialect of Lisp. 

The Unix Programming Environment, by Brian W. Kernighan and Robert Pike, Prentice 
Hall. This book is fun, and it talks about good software design. 


A.3 Linux 


If you have a Linux system, you have all the source code for the entire system available 
(this is also true for some Unix systems). (If it's not installed, you can get it from the 
distribution CDs, from the web site http://www.linux.org, or from the web site of the 
company that produced your version of Linux.) This is a great resource. You can take a 
look at how any program is actually written, even the operating system. Now you're 
really getting into programming! 





A.4 Bioinformatics 


Bioinformatics is a relatively new discipline that's attracting a lot of attention, so the 
available resources are multiplying fairly quickly. Here are a few books and other 
resources to help get you started. 


A.4.1 Books 


Developing Bioinformatics Computer Skills, by Cynthia Gibas and Per Jambeck, 
O'Reilly & Associates. This is a really good book for beginners. It covers setting up a 
Linux workstation and the installation and use of many of the best, and least expensive, 
bioinformatics programs. It teaches how to use bioinformatics programs, not how to 
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program. It's the most practical bioinformatics book available. 

Introduction to Computational Biology: Maps, Sequences and Genomes; by Michael S. 
Waterman; CRC Press. This is a classic book with a predominantly statistical outlook. 
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second 
Edition; edited by Andreas D. Baxecvanis and B.F. Francis Ouellette; John Wiley & Sons. 
Includes chapters on a wide range of topics by several authors. 


A.4.2 Governmental Organizations 


Absolutely essential. The following web sites are for the most important government- 
sponsored bioinformatics organizations: 


http://www.ncbi.nim.nih.gov/: 
the National Center for Biotechnology Information (NCBI). The U.S. 
government center. 


http://www.embl.org/: 


the European Molecular Biology Laboratory (EMBL). The European Union 
laboratory. 


http://www.ebi.ac.uk/: 
the European Bioinformatics Institute (EBI) of EMBL. 


A.4.3 Conferences 


Bioinformatics has long been a part of various biology conferences, for instance the 
Cold Spring Harbor conferences on sequencing. Now there are many conferences 
with such coverage, often under the heading of "genomics." Here are a few 
interesting conferences: 


e ISMB: Intelligent Systems for Molecular Biology, now in its ninth year 
e Bioinformatics Open Source Conference, http://www. bioinformatics.org/ 
e RECOMB: Conference on Computational Molecular Biology 


A.5 Molecular Biology 


Recombinant DNA, by James Watson, et al., W.H. Freeman & Co. This book, though 
getting old for such a fast-moving field, is a gem for programmers and computer 
scientists entering the bioinformatics field. Many standard techniques are clearly and 
briefly explained with excellent illustrations. Look for the second edition, if you can find 
it. 

Molecular Biology of the Gene, Fourth Edition, by James Watson, et al., Addison- 
Wesley. The classic book in molecular biology. It's very detailed; at this level of 
coverage, it's definitely out of date, but it's—well—a classic. Makes a good reference for 
the basics. 

Molecular Cell Biology, Fourth Edition, by Harvey Lodish, et al., W.H. Freeman & Co. 
An excellent and extensive introductory review of cell biology. 
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Appendix B. Perl Summary 


This appendix summarizes those parts of the Perl programming language that will be 
most useful to you as you read this book. It is not a comprehensive summary of the Perl 
language. Remember that Perl is designed so that you don't need to know everything in 
order to use it. Source material for this appendix came from Programming Perl, Third 
Edition (O'Reilly & Associates). 


B.1 Command Interpretation 


The Perl programs in this book start with the line: 


#!/usr/bin/perl -w 


On Unix (or Linux) systems, the first line of a file can include the name of a program and some flags, 
which are optional. The line must start with #!, followed by the full pathname of the program (in our case, 
the Perl interpreter), followed optionally by a single group of one or more flags. 


If the Perl program file was called myprogram, and had executable permissions, you can type myprogram 
(or possibly . /myprogram, or the full or relative pathname for the program) to start the program running. 


The Unix operating system starts the program specified in the command 
interpretation line and gives it as input the rest of the file after the first line. So, in 
this case, it starts the Perl interpreter and gives it the program in the file to run. 

This is just a shortcut for typing: 


/usr/bin/perl -w myprogram 


at the command line. 


B.2 Comments 


A comment begins with a # sign and continues from there to the end of the same 
line. It is ignored by the Perl interpreter and is only there for programmers to read. A 
comment can include any text. 


B.3 Scalar Values and Scalar Variables 
A scalar value is a single item of data, like a string or a number. 
B.3.1 Strings 


Strings are scalar values and are written as text enclosed within single quotes, like 
sO: 


'This is a string in single quotes.' 
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or double quotes, such as: 


"This is a string in double quotes." 
A single-quoted string prints out exactly as written. With double quotes, you can include 
a variable in the string, and its value will be inserted or "interpolated." You can also 


include commands such as \n to represent a newline (see Table B-3): 
Saside = '(or so they say)'; 

Sdeclaration = "Misery\n Saside \nloves company."; 
print S$declaration; 


This snippet prints out: 
Misery 


(or so they say) 
loves company. 


B.3.2 Numbers 


Numbers are scalar values that can be: 


Integers: 
e 3 
e -4 
0 


F loating-point (decimal): 
4.5326 


Scientific (exponential) notation (3.13 x 1023 or 313000000000000000000000): 


3e¢LSH23 
Hexadecimal (base 16) : 





Ox12b6e3 
Octal (base 8): 


O5777 
Binary (base 2): 


0610101011 
Complex (or imaginary) numbers, such as 3 + i, and fractions (or ratios, or rational 
numbers), such as 1/3, can be a little tricky. Perl can handle fractions but converts them 
internally to floating-point numbers, which can make certain operations go wrong (Perl is 


not alone among computer languages in this regard.): 
af ( 10/2 == { (173) * 10) 4 

print "Success!"; 
pelse { 

print "Failure!"; 


} 
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This prints: 
Failure! 


To properly handle rational arithmetic with fractions, complex numbers, or many 
other mathematical constructs, there are mathematics modules available, which 
aren't covered here. 


B.3.3 Scalar Variables 


Scalar values can be stored in scalar variables. A scalar variable is indicated with a $ 
before the variable's name. The name begins with a letter or underscore and can have any 
number of letters, underscores, or digits. A digit, however, can't be the first character in a 


variable name. Here are some examples of legal names of scalar variables: 
SVar 
Svar 1 


Here are some improper names for scalar variables: 


Slvar 
Svar!iable 


Names are case sensitive: S$dna is different from $DNA. 


These rules for making proper variable names (apart from the beginning $) also hold for the names of array 
and hash variables and for subroutine names. 


A scalar variable may hold any type of scalar value mentioned previously, such as 
strings or the different types of numbers. 


B.4 Assignment 


Scalar variables are assigned scalar values with an assignment statement. For 
instance: 


Sthousand = 1000; 


assigns the integer 1,000, a scalar value, to the scalar variable $thousand. 


The assignment statement looks like an equal sign from elementary mathematics, but its meaning is 
different. The assignment statement is an instruction, not an assertion. It doesn't mean "$thousand equals 
1,000." It means "store the scalar value 1,000 into the scalar variable Sthousand". However, after the 
statement, the value of the scalar variable Sthousand is, indeed, equal to 1000. 


You can assign values to several scalar variables by surrounding variables and values 
in parentheses and separating them by commas, thus making lists: 


(Sone, Stwo, $three) = ( 1, 2, 3); 
There are several assignment operators besides = that are shorthand for longer 
expressions. For instance, $a += $b is equivalent to $a = $a + $b. Table B-1 is a 
complete list (it includes several operators that aren't covered in this book). 

Table B-1. Assignment operator shorthands 
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Example of operator 


$a += $b 


$a -= $b 


$a *= $b 


$a /= $b 


$a **= $b 


$a %= $b 


$a x= $b 


$a &= $b 


$a |= $b 


$a “= $b 


$a >>= $b 


$a <<= $b 


$a &&= $b 


$a ||= $b 


$a .= $b 





Equivalent 


$a = $a + $b (addition) 


$a = $a - $b (Subtraction) 


$a = $a * $b (multiplication) 


$a = $a/ $b (division) 


$a = $a ** $b (exponentiation) 


$a = $a % $b (remainder of $a / $b) 


$a = $a x $b (string $a repeated $b times) 


$a = $a & $b (bitwise AND) 


$a = $a | $b (bitwise OR) 


$a = $a * $b (bitwise XOR) 


$a = $a >> $b ($a shift $b bits) 


$a = $a >> $b ($a Shift $b bits to left) 


$a = $a && $b (logical AND) 


$a = $a || $b (logical OR) 


$a = $a. $b (append string $b to $a) 


B.5 Statements and Blocks 
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Programs are composed of statements often grouped together into blocks. 


A statement ends with a semicolon (;), which is optional for the last statement in a 
block. 


A block is one or more statements usually surrounded by curly braces; here's an 
example: 


{ 
S$thousand = 1000; 
print Sthousand; 
} 
Blocks may stand by themselves but are often associated with such constructs as loops or 


if statements. 


B.6 Arrays 


Arrays are ordered collections of zero or more scalar values, indexed by position. An 
array variable begins with the at sign @ followed by a legal variable name. For instance, 


here are two possible array variable names: 
@arrayl 
@dna_fragments 


You can assign scalar values to an array by placing the scalar values in a list, 
separated by commas and surrounded by a pair of parentheses. For instance, you 
can assign an array the empty list: 


@array = ( ); 


or one or more scalar values: 


@dna_fragments = ('ACGT', Sfragment2, 'GGCGGA'); 


Notice that it's okay to specify a scalar variable such as Sfragment2 in a list. Its current value, not the 
variable name, is placed into the array. 


The individual scalar values of an array (the elements) are indexed by their position in the array. The index 
numbers begin at 0. You can specify the individual elements of an array by preceding the array name by a 
S and following it with the index number of the element within square brackets [ ], like so: 


Sdna_fragments [2] 


This equals the value of 'GGCGGA', given the values previously set for this array. Notice that the array has 
three scalar values, indexed by numbers 0, 1, and 2. The third and last element is indexed 2, one less than 
the total number of elements 3, because the first element is indexed number 0. 


You can make a copy of an array using an assignment operator =, as in this example that makes a copy 
@output ofan existing array @input: 


@output = @input; 
If you evaluate an array in scalar context, the value is the number of elements in the array. 
So if array @input has five elements, the following example assigns the value 5 to 


Scount: 
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Scount = @input; 
Figure B-1 shows an array @myarray with three elements, which demonstrates the 
ordered nature of an array; by which each element appears, and can be found by its 
position in the array. 


Figure B-1. Schematic of an array 


Arrays: 
@myorroy= {'DNA', ‘RNA’, ’Protein’); 
Positions: 0 ] 2 


Slaves [DM RX [Pot 


B.7 Hashes 


A hash (also called an associative array) is a collection of zero or more pairs of scalar 
values, called keys and values. The values are indexed by the keys. An array variable 
begins with the percent sign % followed by a legal variable name. For instance, possible 


hash variable names are: 
Shashl 
sgenes by name 


You can assign a value to a key with a simple assignment statement. For example, say 
you have a hash called sbaseball_ stadiums and a key Phillies to which you want to 


assign the value Veterans Stadium. This statement accomplishes the assignment: 
Sbaseball stadiums{'Phillies'} = 'Veterans Stadium'; 


Note that a single hash value is referenced by a $ instead of a % at the beginning of the 
hash name; this is similar to the way you reference individual array values by using a $ 
instead of a @. 


You can assign several keys and values to a hash by placing their scalar values in a 
list, separated by commas and surrounded by a pair of parentheses. Each successive 
pair of scalars becomes a key and a value in the hash. For instance, you can assign a 
hash the empty list: 


Shash = (_ ); 


You can also assign one or more scalar key-value pairs: 


sgenes by name = ('genel', 'AACCCGGTTGGTT', 'gene2', 'CCTTTCGGAAGGTC') ; 


There is an another way to do the same thing, which makes the key-value pairs 
more readily apparent. This accomplishes the same thing as the preceding example: 


sgenes by name = ( 

"genel' => 'AACCCGGTTGGTT', 

‘gene2* => NCCTTICCGAAGGCTC' 
\; 
To get the value associated with a particular key, precede the hash name with a $ and 
follow it with a pair of curly braces { } containing the scalar value of the key: 
Sgenes by name{'genel'} 
This returns the value 'AACcCGGTTGGTT', given the value previously assigned to the key 
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'genel' inthe hash genes by name. Figure B-2 shows a hash with three keys. 


Figure B-2. Schematic of a hash 


Hashes: 


Name Assi ment List of key/value 

saksh hand > MCCTACST, 
{rome => ‘CGTACGT’ 
‘frame3' => ‘GTACGT’S: 


Keys Values (unordered) 


CGTACGT 
ACGTACGT 





| frome! 
| frome? _| 


B.8 Operators 


Operators are functions that represent basic operations on values: addition, subtraction, 
etc. They are frequently used and are core parts of the Perl programming language. They 
are really just functions that take arguments. For instance, + is the operator that adds two 
numbers, like so: 

3 + 4; 


Operators typically have one, two, or three operands; in the example just given, there are two operands 3 
and 4. 


Operators can appear before, between, or after their operands. For example, the plus operator + appears 
between its operands. 


B.9 Operator Precedence 


Operator precedence determines the order in which the operations are applied. For 
instance, in Perl, the expression: 


3+3*4 

isn't evaluated left to right, which calculates 3 + 3 equals 6, and 6 times 4 results in a 
value of 24; the precedence rules cause the multiplication to be applied first, for a final 
result of 15. The precedence rules are available in the per/op manual page and in most 
Perl books. However, I recommend you use parentheses to make your code more 
readable and to avoid bugs. They make these expressions unambiguous; the first: 


(3 + 3) * 4 
evaluates to 24, and the second: 


3 (3% 4) 


evaluates to 15. 


B.10 Basic Operators 


IT-SC 362 


For more information on how operators work, consult the per/jop documentation bundled 
with Perl. 


B.10.1 Arithmetic Operators 


Perl has the basic five arithmetic operators: 


+ 
Addition 
Subtraction 
K 
Multiplication 
/ 
Division 
KK 


Exponentiation 


These operators work on both integers and floating-point values (and if you're not 
careful, strings as well). 


Perl also has a modulus operator, which computes the remainder of two integers: 


fo) 


% modulus 
For example, 17 % 3 is 2, because 2 is left over when you divide 3 into 17. 


Perl also has autoincrement and autodecrement operators: 


++ add one 
-- subtract one 


Unlike the previous six operators, these change a variable's value. $x++ adds one to $x, 
changing 4 to 5 (or a to b). 


B.10.2 Bitwise Operators 


All scalars, whether numbers or strings, are represented as sequences of individual 
bits "under the hood." Every once in a while, you need to manipulate those bits, and 
Perl provides five operators to help: 


& 

Bitwise and 

Bitwise or 
Nw 

Bitwise xor 
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>> 

Right shift 
<< 

Left shift 


B.10.3 String Operators 


Two strings may be concatenated—joined together end to end—with the dot operator: 
‘This. is a * . *joined string" 


This results in the value 'This is ajoined string’. 
A string may be repeated with the x operator: 

print "Hear ye! " x 3; 

This prints out: 


Hear ye! Hear ye! Hear ye! 





B.10.4 File Test Operators 


File test operators are unary operators that test files for certain characteristics, such as -e 
$file, which returns true if the file $file exists. Table B-2 lists some available file test 
operators. 

Table B-2. File test operators 





Operator Meaning 
-r File is readable 

-W File is writable 

-X File is executable 

-O File is owned by "you" 

-e File exists 

-Z File has zero size in bytes 

-S File has nonzero size (returns size in bytes) 
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-f File is a plain file 


-d File is a directory (a.k.a. folder) 


. File is a symbolic link 


-t Filehandle is opened to a terminal 

-T File is a text file 

-B File is a binary file 

-M Age of file (at startup of program) in days since modification 

-A Age of file (at startup of program) in days since last access 

-C Age of file (at startup of program) in days since last inode change 





B.11 Conditionals and Logical Operators 
This section covers conditional statements and logical operators. 


B.11.1 true and false 


In a conditional test, an expression evaluates to true or false, and based on the result, a statement or 
block may or may not be executed. 


A scalar value can be true or false in a conditional. A string is false if it's the empty string 
(represented as "" or ''). A string is true if it's not the empty string. 


Similarly, an array or a hash is false if empty, and true if nonempty. 
A number is false if it's 0; a number is true if it's not 0. 


Most things you evaluate in Perl return some value (such as a number from an 
arithmetic expression or an array returned from a subroutine), so you can use most 
things in Perl in conditional tests. Sometimes you may get an undefined value, for 
instance if you try to add a number to a variable that has not been assigned a value. 
Then things might fail to work as expected. For instance: 
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use strict; 
use warnings; 
my Sa; 

my Sb; 

$b = Sa t+ 2; 


produces the warning output: 


Use of uninitialized value in addition (+) at - line 5. 
You can test for defined and undefined values with the Perl function defined. 





B.11.2 Logical Operators 


There are four logical operators: 


not 
and 
or 


xOr 


not turns true values into false and false values into true. Its use is best illustrated in 


code: 
if(not Sdone) {...} 


This executes the code only if $done is false. 
and is a binary operator that returns true if both its operands are true. If one or both of 
the operands are false, the operator returns false: 


1 and 1 returns true 
‘a’ and * returns false 
‘rand 0 returns false 


or is a binary operator that returns true if one or both of the operands are true. If both 
operands are false, it returns false: 


aL: or returns true 
Mar or Te returns true 
mY om 0 returns false 


xor, or exclusive-OR, returns true if one operand is true and the other operand is false; 


xor returns false if both operands are true or if both operands are false: 
1 xor 0 returns true 
O. 6620 al returns true 
L Seow 1 returns false 
0 xor 0 returns false 


There are also variants on most of these: 


! for not 
&& for and 


|| fer ior 


These have different precedence but otherwise behave the same. Some older 
versions of Perl may only have: 
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&& 


instead of not or and. 


B.11.3 Using Logical Operators for Control Flow 


A quick and popular way to take an action depending on the results of a previous 
action is to chain the statements together with logical operators. For instance, it's 
common in Perl programs to see the following statement to open a file: 





open(FH, $filename) or die "Cannot open file $filename: $!"; 


The use of or in this statement shows another important thing about the binary logical operators: they 
evaluate their arguments left to right. In this case, if the open succeeds, the or operator never bothers to 
check the value of the second operand (die, which exits the program with the message in the string, plus 
additional messages if $! is included). The or never bothers, because if one operand is true, the or is 
true, so it doesn't need to check the second operand. However, if the open fails, the or needs to check 
that the second operand is true or false, so it goes ahead and executes the die statement. 


You can use the and statement similarly to test the second operand only if the first operand succeeds. 
xor doesn't work for control flow, since both its arguments have to be evaluated each time. 
I haven't used this chaining of logical operators much; I've used if statements instead. This is because I 


often find that I want to add more statements following a test, and it's easier if the original is written as an 
if statement with a block, and harder if the original is written as a logical operator. 


B.11.4 The if Statement 


Conditional tests are commonly found in if statements and in their variants and loops. 


Here's an example of an if statement: 
if (open (FH, $filename) { 
print "Hurray, I opened the file."; 





} 


The if statement is followed by a conditional expression enclosed in parentheses, which is followed by a 
block enclosed in curly braces { }. When the conditional expression evaluates as true, the statements in 
the block are executed. 


The if statement may optionally be followed by an else, which is executed when the conditional 
evaluates to false: 


if ( open(FH, $filename) { 

print "Hurray, I opened the file."; 
} else { 

print "Rats. The file did not open."; 





} 
The if statement may also optionally include any number of elsif clauses, which check 


additional conditional statements if none of the preceding conditional statements are true: 
if ( open(FH, Sfilel) { 
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Ww 


print "Hurray, I opened file 1."; 
} elsif ( open(FH, Sfile2) { 


print "Hurray, I opened file 2."; 
} elsif ( open(FH, Sfile3) { 

print "Hurray, I opened file 3."; 
} else { 











print "None of the dadblasted files would open."; 
} 


In the preceding example, if file 1 opened successfully, the if statement doesn't try to open additional 
files. 


There is also an unless statement, which is the same as an if statement with the conditional negated. So 
these two statements are equivalent: 


unless ( open(FH, $filename) { 
print "Rats. The file did not open."; 
} 


if ( not open(FH, $filename) { 
print "Rats. The file did not open."; 


B.12 Binding Operators 


Binding operators are used for pattern matching, substitution, and transliteration on 
strings. They are used in conjunction with regular expressions that specify the 
patterns. Here's an example: 


"ACGTACGTACGTACGT! =~ /CTA/ 


The pattern is the string CTA, enclosed by forward slashes //. The string binding operator is =~; it tells the 
program which string to search, returning t rue if the pattern appears in the string. 


Another string binding operator is ! ~, which returns t rue if the pattern isn't in the string: 
"ACGTACGTACGTACGT!' !~ /CTA/ 


This is equivalent to: 


not 'ACGTACGTACGTACGT! =~ /CTA/ 
You can substitute one pattern for another using the string binding operator. In the next 
example, s/thine/nine/ is the substitution command, which substitutes the first 


occurrence of thine with the string nine: 

Spoor richard = 'A stitch in time saves thine.'; 
Spoor richard =~ s/thine/nine/; 

print $poor richard; 


This produces the output: 


A stitch in time saves nine. 

Finally, the transliteration (or translate) operator tr substitutes characters in a string. It 
has several uses, but the two uses I've covered are first, to change bases to their 
complements A —>T, C  G, G —C, and TA: 
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SDNA = 'ACGTTTAA'; 
SDNA =~ tr/ACGT/TGCA/; 


This produces the value: 


TGCAAATT 
Second, the tr operator counts the number of a particular character in a string, as in this 


example which counts the number of Gs in a string of DNA sequence data: 
SDNA = 'ACGTTTAA'; 

Scount = (SDNA =~ tr/A//); 

print $count; 


This produces the value 3. This shows that a pattern match can return a count of the 
number of translations made in a string, which is then assigned to the variable $count. 


B.13 Loops 


Loops repeatedly execute the statements in a block until a conditional test changes 
value. There are several forms of loops in Perl: 


while (CONDITION) {BLOCK} 





until(CONDITION) {BLOCK} 
for (INITIALIZATION ; CONDITION ; RE-INITIALIZATION ) {BLOCK} 





foreach VAR (LIST) {BLOCK} 
for VAR (LIST) {BLOCK} 
do {BLOCK} while (CONDITION) 





do {BLOCK} until (CONDITION) 


The while loop first tests if the conditional is true; if so, it executes the block and then 
returns to the conditional to repeat the process; if false, it does nothing, and the loop is 


over. For example: 

$1 = 3; 

while ( $i) { 
print “Sa wi" ys 
$1i--; 


} 
This produces the output: 


3 
2 
di 


Here's how the loop works. The scalar variable $i is first initialized to 3 (this isn't part of 
the loop). The loop is then entered, and $i is tested to see if it has a true (nonzero) value. 
It does, so the number 3 is printed, and the decrement operator is applied to $i, which 
reduces its value to 2. The block is now over, and the loop starts again with the 
conditional test. It succeeds with the true value 2, which is printed and decremented. 
The loop restarts with a test of $i, which is now the true value 1; 1 is printed and 
decremented to 0. The loop starts again; 0 is tested to see if it's true, and it's not, so the 
loop is now finished. 
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Loops often follow the same pattern, in which a variable is set, and a loop is called, 
which tests the variable's value and then executes a block, which includes changing 
the value of the variable. 


The for loop makes this easy by including the variable initialization and the variable 
change in the loop statement. The following is exactly equivalent to the preceding 
example and produces the same output: 
for ( $1 = 3 7 $1 7 Si-=) { 

prin. “Sa \niy 
} 
The foreach loop is a convenient way to iterate through the elements in an array. Here's 
an example: 
@array = ('one', 'two', 'three'); 


foreach Selement (@array) { 
print Selement\n"; 


} 
This prints the output: 


one 
two 
three 


The foreach loop specifies a scalar variable $element to be set to each element of the array. (You may 
use any variable name or none, in which case the special variable $_ is used automatically.) The array to 
be iterated over is then placed in parentheses, followed by the block. You can use for instead of 
foreach as the name of this loop, with identical behavior. 


The first time through the loop, the value of the first element of the array is assigned to the foreach 
variable $element. On each succeeding pass through the loop, the value of the next element of the array 
is assigned to the foreach variable $element. The loop exits after it has reached the end of the array. 


There is one important point to make, however. If in the block you change the value 
of the loop variable $element, the array is changed, and the change stays in effect after you've left 
the foreach loop. For example: 


@array = ('one', 'two', 'three'); 


foreach Selement (@array) { 
Selement = 'four'; 


} 


foreach Selement (@array) { 
print Selement,"\n"; 


} 
produces the output: 


four 
imenol 
four 


In the do-until loop, the block is executed before the conditional test, and the test 
succeeds until the condition is true: 
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Si = 3; 

do { 
print Sa; \n"s 
Si--; 

} until ( $i ); 


This prints: 


3 
In the do-while loop, the block is executed before the conditional test, and the test 


succeeds while the condition is true: 
Si = 3; 
do { 
prank Si," na"; 
oles 7 
} while (¢€ Si); 


This prints: 


3 
2 
1 


B.14 Input/Output 


This section covers getting information into programs and receiving data back from 
them. 


B.14.1 Input from Files 


Perl has several convenient ways to get information into a program. In this book, I've 
emphasized opening files and reading in the information contained in them, because it is 
frequently used, and because it behaves very much the same way on all different 
operating systems. You've observed the open and close system calls and how to 
associate a filehandle with a file when you open it, which then is used to read in the data. 


As an example: 

open (FILEHANDLE, "informationfile") ; 
@data_from_informationfile = <FILEHANDLE>; 
close (FILEHANDLE) ; 


This code opens the file informationfile and associates the filehandle FILEHANDLE 
with it. The filehandle is then used within angle brackets < > to actually read in the 
contents of the file and store the contents in the array @data_from_informationfile. 
Finally, the file is closed by referring once again to the opened filehandle. 






































B.14.2 Input from STDIN 


Perl allows you to read in any input that is automatically sent to your program via 
standard input (STDIN). STDIN is a filehandle that by default is always open. Your 
program may be expecting some input that way. For instance, on a Mac, you can 
drag and drop a file icon onto the Perl applet for your program to make the file's 
contents appear in STDIN. On Unix systems, you can pipe the output of some other 
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program into the STDIN of your program with shell commands such as: 
someprog | my perl program 
You can also pipe the contents of a file into your program with: 


cat file | my _perl program 


or with: 
my perl program < file. 


Your program can then read in the data (from program or file) that comes as STDIN 
just as if it came from a file that you've opened: 


@data_from_stdin = <STDIN>; 
B.14.3 Input from Files Named on the Command Line 


You can name your input files on the command line. <> is shorthand for <arcv>. The 
arcv filehandle treats the array @ARGv as a list of filenames and returns the contents of all 
those files, one line at a time. Perl places all command-line arguments into the array 
earcv. Some of these may be special flags, which should be read and removed from 
earcv if there will also be datafiles named. Perl assumes that anything in @arcv refers to 
an input filename when it reaches a < > command. The contents of the file or files are 


then available to the program using the angle brackets < > without a filehandle, like so: 
@data_from_ files = <>; 


For example, on Microsoft, Unix, or on the MacOS X, you specify input files at the 
command line, like so: 


% my program filel file2 file3 








B.14.4 Output Commands 


The print statement is the most common way to output data from a Perl program. The 
print statement takes as arguments a list of scalars separated by commas. An array can 


be an argument, in which case, the elements of the array are all printed one after the other: 
@array = ('DNA', 'RNA', 'Protein'); 
print @array; 


This prints out: 


DNARNAProtein 
If you want to put spaces between the elements of an array, place it between double 


quotes in the print statement, like this: 
@array = ('DNA', 'RNA', 'Protein'); 
print "@array"; 


This prints out: 


DNA RNA Protein 
The print statement can specify a filehandle as an optional indirect object between the 
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print statement and the arguments, like so: 
print FH "@array"; 


The printf function gives more control over the formatting of the output of numbers. For instance, you 
can specify field widths; the precision, or number of places after the decimal point; and whether the value is 
right- or left-justified in the field. I showed the most common options in Chapter 12 and refer you to the 
Perl documentation that comes with your copy of Perl for all the details. 


The sprintf function is related to the printf function; it formats a string instead of printing it out. 


The format and write commands are a way to format a multiline output, as when generating reports. 
format can be a useful command, but in practice it isn't used much. The full details are available in your 
Perl documentation, and O'Reilly's Programming Perl contains an entire chapter on format. You can 
also see format in Chapter 12 of this book. 


B.14.4.1 Output to STDOUT, STDERR, and Files 


Standard output, with the filehandle STDOUT, is the default destination for output from a 
Perl program, so it doesn't have to be named. The following two statements are 
equivalent unless you used select to change the default output filehandle: 


print "Hello biology world! \n"; 
print STDOUT "Hello biology world!\n"; 


Note that the STDOUT isn't followed by a comma. STDOUT is usually directed to the 
computer screen, but it may be redirected at the command line to other programs or files. 
This Unix command pipes the STDOUT of my program to the STDIN of your program: 
my program | your program 

This Unix command directs the output of my program to the file outputfile: 

my program > outputfile 


It's also common to direct certain error messages to the predefined standard error 
filehandle STDERR or to a file you've opened for input and named with a particular 
filehandle. Here are examples of these two tasks: 


print STDERR "If you reached this part of the program, something is 
terribly wrong!"; 


open(OUTPUTFD, ">output_file"); 

print OUTPUTFD "Here is the first line in the output file 
output_file\n"; 

STDERR is also usually directed to the computer screen by default, but it can be directed 
into a file from the command line. This is done differently for different systems, for 
example, as follows (on Unix with the sh or bash shells): 

myprogram 2>myprogram.error 





You can also direct STDERR to a file from within your Perl program by including code 
such as the following before the first output to STDERR. This is the most portable 
way to redirect STDERR: 





open (STDERR, ">myprogram.error") or die "Cannot open error file 
myprogram.error:S!\n"; 


The problem with this is that the original STDERR is lost. This method, taken from 
Programming Perl, saves and restores the original STDERR: 
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open ERRORFILE, ">myprogram.error"” 

or die "Can't open myprogram.error"; 
open SAVEERR, ">&STDERR"; 
open STDERR, ">&ERRORFILE; 














[pel Ee 








print STDERR "This will appear in error file myprogram.error\n"; 
tore STDERR 
ERR; 

RR, ">&SAV 





# now, res 
close STD 
open STDE 











zr 


IRR"; 

















print: STD 


Gl 


RR "This will appear on the computer screen\n"; 


There are a lot of details concerning filehandles not covered in this book, and 
redirecting one of the predefined filehandles such as STDERR can cause problems, 
especially as your programs get bigger and rely more on modules and libraries of 
subroutines. One safe way is to define a new filehandle associated with an error file 
and to send all your error messages to it: 





open (ERRORMESSAGES, ">myprogram.error") 
or die "Cannot open myprogram.error:S$!\n"; 

















print ERRORMESSAGES "This is an error message\n"; 
Note that the die function, and the closely related warn function, print their error 
messages to STDERR. 





B.15 Regular Expressions 


Regular expressions are, in effect, an extra language that lives inside the Perl 
language. In Perl, they have quite a lot of features. First, I'll summarize how regular 
expressions work in Perl; then, I'll present some of their many features. 


B.15.1 Overview 


Regular expressions describe patterns in strings. The pattern described by a single 
regular expression may match many different strings. 


Regular expressions are used in pattern matching, that is, when you look to see if a certain pattern exists in 
a string. They can also change strings, as with the s/// operator that substitutes the pattern, if found, for a 
replacement. Additionally, they are used in the tr function that can transliterate several characters into 
replacement characters throughout a string. Regular expressions are case-sensitive, unless explicitly told 
otherwise. 


The simplest pattern match is a string that matches itself. For instance, to see if the pattern 'abc' appears 
in the string 'abcdefghijklmnopaqrstuvwxyz', write the following in Perl: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
if( Salphabet =~ /abc/ ) { 
print $&; 


} 


The =~ operator binds a pattern match to a string. /abc/ is the pattern abc, enclosed in forward slashes 
// to indicate that it's a regular-expression pattern. $& is set to the matched pattern, if any. In this case, the 
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match succeeds, since 'abc' appears in the string $alphabet, and the code just given prints out abc. 


Regular expressions are made from two kinds of characters. Many characters, such as 'a' or 'Z', match 
themselves. Metacharacters have a special meaning in the regular-expression language. For instance, 
parentheses ( ) are used to group other characters and don't match themselves. If you want to match a 
metacharacter such as ( in a string, you have to precede it with the backslash metacharacter \ ( in the 
pattern. 


There are three basic ideas behind regular expressions. The first is concatenation: 
two items next to each other in a regular-expression pattern (that's the string 
between the forward slashes // in the examples) must match two items next to each other in the 
string being matched (the Salphabet in the examples). So to match 'abc' followed by 'def', 
concatenate them in the regular expression: 


Salphabet = 'abcdefghijklmnopgqrstuvwxyz'; 
if( Salphabet =~ /abcdef/ ) { 
print $.&; 
} 
This prints: 
abcdef 


The second major idea is alternation. Items separated by the | metacharacter match any 
one of the items. For example: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
if( Salphabet =~ /a(b|cld)c/ ) { 
print $&; 
} 
prints as: 
abc. 


The example also shows how parentheses group things in a regular expression. The parentheses are 
metacharacters that aren't matched in the string; rather, they group the alternation, given as b|c|d, 
meaning any one of b, c, or d at that position in the pattern. Since b is actually in Salphabet at that 
position, the alternation, and indeed the entire pattern a (b|c|d)c, matches in the Salphabet. (One 
additional point: ab | cd means (ab) | (cd), not a (b|c) d.) 


The third major idea of regular expressions is repetition (or closure). This is indicated in a pattern with the 
quantifier metacharacter *, sometimes called the Kleene star after one of the inventors of regular 
expressions. When * appears after an item, it means that the item may appear 0, 1, or any number of times 
at that place in the string. So, for example, all of the following pattern matches will succeed: 





"AC' =~ /AB*C/; 
"ABC! =~ /AB*C/; 
"ABBBBBBBBBBBC! =~ /AB*C/; 


B.15.2 Metacharacters 


The following are metacharacters: 


Si G2 [a eae 
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B.15.2.1 Escaping with \ 


A backslash \ before a metacharacter causes it to match itself; for instance, \\ matches a 
single \ in the string. 


B.15.2.2 Alternation with | 

The pipe | indicates alternation, as described previously. 
B.15.2.3 Grouping with ( ) 

The parentheses ( ) provide grouping, as described previously. 
B.15.2.4 Character classes 


Square brackets [ ] specify a character class. A character class matches one character, 
which can be any character specified. For instance, [abc] matches either a, or b, or c at 
that position (so it's the same as a|b|c). A -Z is a range that matches any uppercase letter, 
a-z matches any lowercase letter, and 0-9 matches any digit. For instance, [A-Za-z0-9] 

matches any single letter or digit at that position. If the first character in a character class 
is *, any character except those specified match; for instance, [*0-9] matches any 
character that isn't a digit. 


B.15.2.5 Matching any character with . 


The period or dot . represents any character except a newline. (The pattern modifier /s 
makes it also match a newline.) So, . is like a character class that specifies every 
character. 


B.15.2.6 Beginning and end of strings with “ and $ 


The * metacharacter doesn't match a character; rather, it asserts that the item that follows 
must be at the beginning of the string. Similarly, the s metacharacter doesn't match a 
character but asserts that the item that precedes it must be at the end of the string (or 
before the final newline). For example: /*wWatson and Crick/ matches if the string 
starts with Watson and Crick; and /Watson and Crick$/ matches if the string ends 
with Watson and Crick Of Watson and Crick\n. 


B.15.2.7 Quantifiers: * + {MIN,} {MIN,MAX} ? 


These metacharacters indicate the repetition of an item. The * metacharacter indicates 
zero, one, or more of the preceding item. The + metacharacter indicates one or more of 
the preceding item. The brace { } metacharacters let you specify exactly the number of 
previous items, or a range. For instance, {3} means exactly three of the preceding item; 
{3,7} means three, four, five, six, or seven of the preceding item; and {3, } means three 
or more of the preceding item. The ? matches none or one of the preceding item. 
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B.15.2.8 Making quantifiers match minimally with ? 


The quantifiers just shown are greedy (or maximal) by default, meaning that they match 
as many items as possible. Sometimes, you want a minimal match that will match as few 
items as possible. You get that by following each of * + {} ? witha ?. So, for instance, *? 
tries to match as few as possible, perhaps even none, of the preceding item before it tries 
to match one or more of the preceding item. Here's a maximal match: 





"hear ye hear ye hear ye' =~ /hear.*ye/; 

print S.&; 

This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', 
and prints: 


hear ye hear ye hear y 





Here is a minimal match: 


"hear ye hear ye hear ye' =~ /hear.*?ye/; 
print $&; 
This matches 'hear' followed by .*? (the fewest number of characters possible), 


followed by 'ye', and prints: 
hear ye 





B.15.3 Capturing Matched Patterns 


You can place parentheses around parts of the pattern for which you want to know 
the matched string. For example: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
Salphabet =~ /k(lmnop)q/; 

print $1; 

prints: 

imnop 


You can place as many pairs of parentheses in a regular expression as you like; Perl 
automatically stores their matched substrings in special variables named $1, $2, and so on. 
The matches are numbered in order of the left-to-right appearance of their opening 
parenthesis. 


Here's a more intricate example of capturing parts of a matched pattern in a string: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
Salphabet =~ /(((a)b)c)/; 

print "First pattern = ", $1,"\n"; 

print "Second pattern = ", $2,"\n"; 

print “Third pattern =", $3, "\n"; 

This prints: 


First pattern = abc 
Second pattern = ab 
Third pattern =a 
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B.15.4 Metasymbols 


Metasymbols are sequences of two or more characters consisting of backslashes before 
normal characters. These metasymbols have special meanings in Perl regular expressions 
(and in double-quoted strings for most of them). There are quite a few of them, but that's 
because they're so useful. Table B-3 lists most of these metasymbols. The column 
"Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just 
makes an assertion, and - if it takes some other action. 

Table B-3. Alphanumeric metasymbols 





Symbol | Atomic Meaning 
\0 Yes Match the null character (ASCII NULL) 
\NNN Yes Match the character given in octal, up to 377 
\n Yes Match nth previously captured string (decimal) 
\a Yes Match the alarm character (BEL) 
\A No true at the beginning of a string 
\b Yes Match the backspace character (BS) 
\b No True at word boundary 
\B No True when not at word boundary 
\cX Yes Match the control character Control-X 
\d Yes Match any digit character 
\D Yes Match any nondigit character 
\e Yes Match the escape character (ASCII ESC, not backslash) 
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\E 


\f 


\G 


\l 


\L 


\n 


\Q 


\r 


\s 


\S 


\t 


\u 


\U 


\w 


\w 
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Yes 


No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


\x{abcd} |Yes 


End case (\L, \U) or metaquote (\Q) translation 


Match the formfeed character (FF) 


true at end-of-match position of prior m//g 


Lowercase the next character only 


Lowercase till \E 


Match the newline character (usually NL, but CR on Macs) 


Quote (do-meta) metacharacters till \E 


Match the return character (usually CR, but NL on Macs) 


Match any whitespace character 


Match any nonwhitespace character 


Match the tab character (HT) 


Titlecase the next character only 


Uppercase (not titlecase) till \E 


Match any "word" character (alphanumerics plus _ ) 


Match any nonword character 


Match the character given in hexadecimal 
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\z true at end of string only 


\Z true at end of string or before optional newline 








B.15.5 Extending Regular-Expression Sequences 


Table B-4 includes several useful features that have been added to Perl's regular- 
expression capabilities. 
Table B-4. Extended regular-expression sequences 


Extension Atomic Meaning 
(?#...) No Comment, discard 
(21...) Yes Cluster-only parentheses, no capturing 
(?imsx-imsx) No Enable/disable pattern modifiers 
(?imsx-imsx:...) Yes Cluster-only parentheses plus modifiers 
(?=...) No True if lookahead assertion succeeds 
(?!...) No True if lookahead assertion fails 
(?<=...) No True if lookbehind assertion succeeds 
(?<!...) No True if lookbehind assertion fails 
(?>...) Yes Match nonbacktracking subpattern 
(?{...}) No Execute embedded Perl code 
(2?{...}) Yes Match regex from embedded Perl code 
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(?(...)..-[---) Match with if-then-else pattern 


(?(...)..-) Match with if-then pattern 








B.15.6 Pattern Modifiers 


Pattern modifiers are single-letter commands placed after the forward slashes. They are 
used to delimit a regular expression or a substitution and change the behavior of some 
regular-expression features. Table B-5 lists the most common pattern modifiers, followed 
by an example. 

Table B-5. Pattern modifiers 


Modifier Meaning 
/i Ignore upper- or lowercase distinctions 
/s Let . match newline 
/m Let * and $ match next to embedded \n 
/x Ignore (most) whitespace and permit comments in patterns 
/o Compile pattern once only 
/g Find all matches, not just the first one 





As an example, say you were looking for a name in text, but you didn't know if the name 
had an initial capital letter or was all capitalized. You can use the /i modifier, like so: 


Stext = "WATSON and CRICK won the Nobel Prize"; 
Stext =~ /Watson/i; 
print $&; 


This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints 
out the matched string WATSON. 


B.16 Scalar and List Context 


Every operation in Perl is evaluated in either scalar or list context. Many operators 
behave differently depending on the context they are in, returning a list in list 
context and a scalar in scalar context. 


The simplest example of scalar and list contexts is the assignment statement. If the left 
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side (the variable being assigned a value) is a scalar variable, the right side (the values 
being assigned) are evaluated in scalar context. In the following examples, the right side 
is an array @array of two elements. When the left side is a scalar variable, it causes 
@array to be evaluated in scalar context. In scalar context, an array returns the number of 
elements in an array: 


@array = ('one', 'two'); 
Sa = @array; 

print Sa; 

This prints: 

2 


If you put parentheses around the sa, you make it a list with one element, which causes 
@array to be evaluated in list context: 


@array = ('one', 'two'); 
(Sa) = @array; 

print Sa; 

This prints: 

one 


Notice that when assigning to a list, if there are not enough variables for all the 
values, the extra values are simply discarded. To capture all the variables, you'd do 
this: 


@array = ('one', 'two'); 
(Sa, Sb) = @array; 

print “Sa Sb"; 

This prints: 


one two 
Similarly, if you have too many variables on the left for the number of right variables, the 
extra variables are assigned the undefined value undef. 


When reading about Perl functions and operations, notice what the documentation 
has to say about scalar and list context. Very often, if your program is behaving 
strangely, it's because it is evaluating in a different context than you had thought. 


Here are some general guidelines on when to expect scalar or list context: 


You get list context from function calls (anything in the argument position is evaluated in list context) and 
from list assignments. 


You get scalar context from string and number operators (arguments to such operators as . and + are 
assumed to be scalars); from boolean tests such as the conditional of an if () statement or the arguments 
to the | | logical operator; and from scalar assignment. 


B.17 Subroutines and Modules 


Subroutines a re defined by the keyword sub, followed by the name of the subroutine, 
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followed by a block enclosed by curly braces { } containing the body of the subroutine. 


Here's a simple example: 
sub a_subroutine { 
print "I'm in a subroutine\n"; 


} 


In general, you can call subroutines using the name of the subroutine followed by a 
parenthesized list of arguments: 


a_subroutine(); 

Arguments can be passed into subroutines as a list of scalars. If any arrays are given as 
arguments, their elements are interpolated into the list of scalars. The subroutine receives 
all scalar values as a list in the special variable e@ . This example illustrates a subroutine 


definition and the calling of the subroutine with some arguments: 
sub concatenate dna { 
my($dnal, $dna2) = @ ; 


my (Sconcatenation) ; 
Sconcatenation = "Sdnal$dna2"; 


return Sconcatenation; 


} 


print concatenate dna('AAA', 'CGC'); 
This prints: 


AAACGC 
The arguments 'Aaa' and 'ccc' are passed into the subroutine as a list of scalars. The 
first statement in the subroutine's block: 

my($dnal, $dna2) = @ ; 


assigns this list, available in the special variable @_, to the variables $dnal1 and $dna2. 

The variables $dnal and $dna2 are declared as my variables to keep them local to the subroutine's block. 
In general, you declare all variables as my variables; this can be enforced by adding the statement use 
strict; near the beginning of your program. However, it is possible to use global variables that are not 


declared with my, which can be used anywhere in a program, including within subroutines. In this book, 
I've not used global variables. 


The statement: 

my (Sconcatenation) ; 
declares another variable for use by the subroutine. 
After the statement: 


Sconcatenation = "SdnalSdna2"; 
performs the work of the subroutine, the subroutine defines its value with the return 


statement: 
return Sconcatenation; 
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The value returned from a call to a subroutine can be used however you wish; in this example, it is given as 
the argument to the print function. 


If any arrays are given as arguments, their elements are interpolated into the @__ list, as in the following 
example: 


sub example sub { 
my(@arguments) = @ ; 


print "@arguments\n"; 


} 

my @array = (°two', “three', ‘four'); 
example sub(*one', @array, ~“five'); 
which prints: 

one two three four five 


Note that the following attempt to mix arrays and scalars in the arguments to a 
subroutine won't work: 


# This won't work!! 
sub bad_sub { 
my(@array, Sscalar) = @ ; 


print Sscalar; 


} 


my @arr = ('DNA', 'RNA'); 
my $string = 'Protein'; 


bad_sub(@arr, $string); 

In this example, the subroutine's variable @array on the left side in the assignment 
statement consumes the entire list on the right side in @ , namely ('DNA', 'RNA', 
'Protein'). The subroutine's variable $scalar won't be set, so the subroutine won't 
print 'Protein' as intended. To pass separate arrays and hashes to a subroutine, you 


need to use references; see Section 6.4.1 in Chapter 6. Here's a brief example: 
sub good_sub { 
my(Sarrayref, Shashref) = @ ; 


print "@Sarrayref", "\n"; 


my @keys = keys %Shashref; 


print "@keys", "\n"; 
} 


my @arr = ('DNA', 'RNA'); 
my snums = ( 'one' => 1, 'two' => 2); 


good _sub(\@arr, \%nums) ; 


which prints: 
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DNA RNA 
one two 


B.18 Built-in Functions 


Perl has a great many built-in functions. Table B-6 is a partial list with short descriptions. 


Function 


abs VALUE 


atan2 Y, X 


chdir EXPR 


chmod MODE LIST 


chomp (VARIABLE or LIST) 


chop (VARIABLE or LIST) 


chown UID, GID, LIST 


close FILEHANDLE 


closedir DIRHANDLE 


cos EXPR 


dbmclose HASH 


dbmopen HASH, DBNAME 
MODE 
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Table B-6. Perl built-in functions 
Summary 


Return the absolute value of its numeric argument 


Return the principal value of the arc tangent of Y/X from -z to 
It 


Change the working directory to EXPR (or home directory by 
default) 


Change the file permissions of the LIST of files to MODE 


Remove ending newline from string(s), if present 


Remove ending character from string(s) 


Change owner and group of LIST of files to numeric UID and 
GID 


Close the file, socket, or pipe associated with FILEHANDLE 


Close the directory associated with DIRHANDLE 


Return the cosine of the radian number EXPR 


Break the binding between a DBM file and a hash 


‘|\Bind a DBM file to a HASH with permissions given in MODE 
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defined EXPR 


delete EXPR 


die LIST 


each HASH 


exec PATHNAME LIST 


exists EXPR 


exit EXPR 


exp EXPR 


format 


grep EXPR, LIST 


gmtime 


goto LABEL 


hex EXPR 


index STR, SUBSTR 
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Return true or false if EXPR has a defined value or not 


Delete an element (or slice) from a hash or an array. 


Exit the program with an error message that includes LIST 


Step through a hash with one key, or key-value pair, at a time 


Terminate the program and execute the program PATHNAME 
with arguments LIST 


Return true if hash key or array index exists 


Exit the program with the return value of EXPR 


Return the value of e raised to the exponent EXPR 


Declare a format for use by the write function 


Return list of elements of LIST for which EXPR is true 


Get Greenwich mean time; Sunday is day 0, January is month 
0, year is number of years since 1900—example: 


($sec,$min,$hour,$mday,$mon,$year,$wday,$yday, 


$isdaylightsavingstime) = gmtime; 


Program control goes to statement marked with LABEL 


Return decimal value of hexadecimal EXPR 


Give the position of the first occurrence of SUBSTR in STR 
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int EXPR 


join EXPR, LIST 


keys HASH 


last LABEL 


Ic EXPR 


Icfirst EXPR 


length EXPR 


localtime 


log EXPR 


m/PATTERN/ 


Give the integer portion of the number in EXPR 


Join the strings in LIST into a single string, separated by EXPR 


Return a list of all the keys in HASH 


Exit the immediately enclosing loop by default, or loop with 
LABEL 


Return a lowercased copy of string in EXPR 


Return a copy of EXPR with first character lowercased 


Return the length in characters of EXPR 


Get local time in same format as in gmtime function 


Return natural logarithm of number EXPR 


The match operator for the regular-expression PATTERN, 
often abbreviated as 


/PATTERN/ 


map BLOCK LIST (or map|Evaluate BLOCK or EXPR for each element of LIST, return list 


EXPR, LIST) 


mkdir FILENAME 


my EXPR 


next LABEL 
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of return values 


Create the directory FILENAME 


Localize the variables in EXPR to the enclosing block 


Go to next iteration of enclosing loop by default or to loop 
marked with LABEL 
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oct EXPR Return decimal value of octal value in EXPR 


Open a file by associating FILEHANDLE with the file and 


open FILEHANDLE, EXPR otic given in EXPR 


opendir DIRHANDLE, EXPR_ |Open the directory EXPR and assign handle DIRHANDLE 


pop ARRAY Remove and return the last element of ARRAY 
pos SCALAR Give location in string SCALAR where last m//g search left off 
print FILEHANDLE LIST Print LIST of strings to FILEHANDLE (default STDOUT) 


printf FILEHANDLE FORMAT,|Print string specified by FORMAT and variables LIST to 
LIST FILEHANDLE 


push ARRAY, LIST Place the elements of LIST at the end of ARRAY 


Give pseudorandom decimal number from 0 to less than EXPR 





rand EXPR (default 1) 

readdir DIRHANDLE Return list of entries of directory DIRHANDLE 

redo LABEL Restart a loop block without reevaluating the conditional 
Return true or false if EXPR is a reference or not: if true, 

ref EXPR Paar 
returned value indicates type of reference 

rename OLDNAME, : 

NEWNAME Change the name of a file 

return EXPR Return from the current subroutine with value EXPR 
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reverse LIST Give LIST in reverse order, or reverse strings in scalar context 


Like the index function but returns last occurrence of SUBSTR 


rindex STR, SUBSTR in STR 


rmdir FILENAME Delete the directory FILENAME 


Replace the match of regular-expression PATTERN with string 


s/PAT TERN/REPLACEMENT/ REPLACEMENT 


scalar EXPR Force EXPR to be evaluated in scalar context 


Position the file pointer for FILEHANDLE to OFFSET bytes (if 


seek FILEHANDLE, OFFSET, Witnce is 0, current position plus OFFSET if WHENCE is 1, or 


WHENCE OFFSET bytes from the end if WHENCE is 2) 
shift ARRAY Remove and return the first element of ARRAY 
sin EXPR Return the sine of the radian number EXPR 
sleep EXPR Cause the program to sleep for EXPR seconds 


sort USERSUB LIST (or sort/Sort the LIST according to the order in USERSUB or BLOCK 


BLOCK LIST) (default standard string order) 

splice ARRAY, OFFSET,|Remove LENGTH elements at OFFSET in ARRAY and replace 
LENGTH, LIST with LIST, if present 

split /PATTERN/, EXPR Split the string EXPR at occurrences of /PATTERN/, return list 
sprintf FORMAT, LIST Return a string formatted as in the printf function 

sqrt EXPR Return the square root of the number EXPR. 

srand EXPR Set random number seed for rand operator; only needed in 


versions of Perl before 5.004 

Return statistics on file EXPR or its FILEHANDLE—example: 
($dev,$inode,$mode,$num_of_links, $uid,$gid,$rdev,$size,$ac 
stat (FILEHANDLE or EXPR) _ jcesstime, 





$modifiedtime, $changetime,$blksize, $blocks) = stat 
$filename; 
study SCALAR Try to optimize subsequent pattern matches on string SCALAR 
sub NAME BLOCK Define a subroutine named NAME with program code in 
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substr EXPR, OFFSET, 
LENGTH,REPLACEMENT 


system PATHNAME LIST 


tell FILEHANDLE 
tr/ORIGINAL/REPLACEMENT/ 


BLOCK 

Return substring of string EXPR at position OFFSET and length 
LENGTH; the substring is replaced with REPLACEMENT if used 
Execute any program PATHNAME with arguments LIST; 
returns exit status of program, not its output; to capture 
ouput, use backticks—example: 

@output = */bin/who’; 

Return current file position in bytes in FILEHANDLE 


Transliterates each character in ORIGINAL with corresponding 
character in REPLACEMENT 


truncate (FILEHANDLE  or|Shorten file EXPR or opened with FILEHANDLE to LENGTH 


EXPR), LENGTH 
uc EXPR 
ucfirst EXPR 


undef EXPR 


unlink LIST 

unshift ARRAY, LIST 
use MODULE 

values HASH 


wantarray 
warn LIST 
write FILEHANDLE 
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bytes 
Return uppercased version of string EXPR 
Return string EXPR with first character capitalized 


Return the undefined value; if a defined variable or subroutine 
EXPR is given, it's no longer defined; it can be assigned a 
value when you don't need to save the value 


Delete the LIST of files 

Add LIST elements to the beginning of ARRAY 
Load the MODULE 

Return a list of all values of the HASH 


In a subroutine, return true if calling program expects a list 
return value 


Print error message including LIST 


Write formatted record to FILEHANDLE (default STDOUT) as 
defined by the format function 
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Colophon 


Our look is the result of reader comments, our own experimentation, and feedback 
from distribution channels. Distinctive covers complement our distinctive approach to 
technical topics, breathing personality and life into potentially dry subjects. 


The animals on the cover of Beginning Perl for Bioinformatics are green frog (Rana 
clamitans) and American bullfrog (Rana catesbeiana) tadpoles. 


Tadpoles are the larvae of frogs and toads. They are aquatic and when first hatched 
have large, round heads and long, flat tails. Through a complex process of 
metamorphosis, tadpoles change from small fishlike creatures to the more familiar 
frogs and toads. This process can take from 10 days to 3 years depending on the 
species. 


During the first stages of metamorphosis, a tadpole's hind legs sprout, its head 
begins to flatten, and its tail becomes shorter. In its early life, a tadpole feeds 
primarily on diatoms, algae, and small quantities of zooplankton. As metamorphosis 
continues, it stops eating and begins to reabsorb its tail for sustenance while its 
digestive system changes from primarily vegetarian to carnivorous. During the final 
stages of metamorphosis, the tadpole's front legs appear, its jaws form, its skeleton 
hardens, and its gills disappear as the lungs develop. It soon begins to breathe air at 
the surface of the water. A short time later, the tadpole emerges from the water, 
reabsorbs the last of its tail, and hops off as a frog or a toad. 
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